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PREFACE 

This is a mathematical textbook rather than a compendium of computa- 
tional rules. It is hoped that the material included will provide a useful 
background for those seeking to devise and evaluate routines for numerical 
computation. 

The general topics considered are the solution of finite systems of 
equations, linear and nonlinear, and the approximate representation of 
functions. Conspicuously omitted are functional equations of all types. 
The justification for this omission lies first in the background presupposed 
on the part of the reader. Second, there are good books, in print and in 
preparation, on differential and on integral equations. But ultimately, 
the numerical "solution" of a functional equation consists of a finite table 
of numbers, whether these be a set of functional values, or the first n coeffi- 
cients of an expansion in terms of known functions. Hence, eventually 
the problem must be reduced to that of determining a finite set of numbers 
and of representing functions thereby, and at this stage the topics in this 
book become relevant. 

The endeavor has been to keep the discussion within the reach of 
one who has had a course in calculus, though some elementary notions 
of the probability theory are utilized in the allusions to statistical assess- 
ments of errors, and in the brief outline of the Monte Carlo method. The 
book is an expansion of lecture notes for a course given in Oak Ridge for 
the University of Tennessee during the spring and summer quarters of 
1950. 

The material was assembled with high-speed digital compvitation 
always in mind, though many techniques appropriate only to "hand" 
computation are discussed. By a curious and amusing paradox, the 
advent of high-speed machinery has lent popularity to the two innovations 
from the field of statistics referred to above. How otherwise the con- 
tinued use of these machines will transform the computer's art remains to 
be seen. But this much can surely be said, that their effective use 
demands a more profound understanding of the mathematics of the 
problem, and a more detailed acquaintance with the potential sources of 
error, than is ever required by a computation whose development can be 
watched, step by step, as it proceeds. It is for this reason that a text- 
book on the mathematics of computation seems in order. 

Help and encouragement have come from too many to permit listing 

. . 
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all by name. But it is a pleasure to thank, in particular, J. A. Cooley, 
C. C. Kurd, D. A. Flanders, J. W. Givens, A. de la Garza, and members of 
the Mathematics Panel of Oak Ridge National Laboratory. And for the 
painstaking preparation of the copy, thanks go to Iris Tropp, Gwen 
Wicker, and above all, to Mae Gill. 

A. S. Householder 
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CHAPTER 1 
THE ART OF COMPUTATION 



1. The Art of Computation 

We are concerned here with mathematical principles that are some- 
times of assistance in the design of computational routines. It is hardly 
necessary to remark that the advent of high-speed sequenced computing 
machinery is revolutionizing the art and that it is much more difficult to 
explain to a machine how a problem is to be done than to explain to most 
human beings. Or that the process that is easiest for the human being to 
carry out is not necessarily the one that is easiest or quickest for the 
machine. Not only that, but a process may be admirably well adapted 
to one machine and very poorly adapted to another. Consequently, the 
robot master has very few tried and true rules at his disposal, and is 
forced to go back to first principles to construct such rules as seem to 
conform best to the idiosyncrasy of his particular robot. 

If a computation requires more than a very few operations, there are 
usually many different possible routines for achieving the same end 
result. Even so simple a computation as ab/c can be done (ab)/c, (a/c)b, 
or a(6/c), not to mention the possibility of reversing the order of the 
factors in the multiplication. Mathematically these are all equivalent; 
computationally they are not (cf. 1.2 and 1.4). Various, and some- 
times conflicting, criteria must be applied in the final selection of a par- 
ticular routine. If the routine must be given to someone else, or to a 
computing machine, it is desirable to have a routine in which the steps 
are easily laid out, and this is a serious and important consideration 
in the use of sequenced computing machines. Naturally one would 
like the routine to be as short as possible, to be self-checking as far as 
possible, to give results that are at least as accurate as may be required. 
And with reference to the last point, one would like the routine to be 
such that it is possible to assert with confidence (better yet, with cer- 
tainty) and in advance that the results will be as accurate as may be 
desired, or if an advance assessment is out of the question, as it often is, 
one would hope that it can be made at least upon completion of the 
computation. 

1.1. Errors and Blunders. The number 0.33, when expressing the 
result of the division 1 -f- 3, is correctly obtained even though it deviates 

1 
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by 1 per cent from the true quotient. The number 0.334, when express- 
ing the result of the same division, deviates by only 0.2 per cent from the 
true quotient, and yet is incorrectly obtained. The deviation of 0.33 
from the true quotient will be called an error. If the division is to be 
carried out to three places but not more, then 0.333 is the best representa- 
tion possible and the replacement of the final "3" by a final "4" will be 
called a blunder. 

Blunders result from fallibility, errors from finitude. Blunders will 
not be considered here to any extent. There are fairly obvious ways to 
guard against them, and their effect, when they occur, can be gross, 
insignificant, or anywhere in between. Generally the sources of error 
other than blunders will leave a limited range of uncertainty, and gen- 
erally this can be reduced, if necessary, by additional labor. It is impor- 
tant to be able to estimate the extent of the range of uncertainty. 

Four sources of error are distinguished by von Neumann and Goldstine, 
and while occasionally the errors of one type or another may be negligible 
or absent, generally they are present. These sources are the following: 

1. Mathematical formulations are seldom exactly descriptive of any 
real situation, but only of more or less idealized models. Perfect gases 
and material points do not exist. 

2. Most mathematical formulations contain parameters, such as 
lengths, times, masses, temperatures, etc., whose values can be had only 
from measurement. Such measurements may be accurate to within 1, 
0.1, or 0.01 per cent, or better, but however small the limit of error, it is 
not zero. 

3. Many mathematical equations have solutions that can be con- 
structed only in the sense that an infinite process can be described whose 
limit is the solution in question. By definition the infinite process can- 
not be completed, so one must stop with some term in the sequence, 
accepting this as the adequate approximation to the required solution. 
This results in a type of error called the truncation error. 

4. The decimal representation of a number is made by writing a 
sequence of digits to the left, and one to the right, of an origin which is 
marked by the decimal point. The digits to the left of the decimal are 
finite in number and are understood to represent coefficients of increasing 
powers of 10 beginning with the zeroth; those to the right are possibly 
infinite in number, and represent coefficients of decreasing powers of 10. 
In digital computation only a finite number of these digits can be taken 
account of. The error due to dropping the others is called the round-off 
error. 

In decimal representation 10 is called the base of the representation. 
Many modern computing machines operate in the binary system, using 
the base 2 instead of the base 10. Every digit in the two sequences is 
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either or 1, and the point which marks the origin is called the binary 
point, rather than the decimal point. Desk computing machines which 
use the base 8 are on the market, since conversion between the bases 2 
and 8 is very simple. Colloquial languages carry the vestiges of the use 
of other bases, e.g., 12, 20, 60, and in principle, any base could be used. 

Clearly one does not evaluate the error arising from any one of these 
sources, for if he did, it would no longer be a source of error. Generally 
it cannot be evaluated. In particular cases it can be evaluated but not 
represented (e.g., in the division 1 -* 3 carried out to a preassigned 
number of places). But one does hope to set bounds for the errors and 
to ascertain that the errors will not exceed these bounds. 

The computer is not responsible for sources 1 and 2. He is not 
concerned with formulating or assessing a physical law nor with making 
physical measurements. Nevertheless, the range of uncertainty to 
which they give rise will, on the one hand, create a limit below which the 
range of uncertainty of the results of a computation cannot come, and 
on the other hand, provide a range of tolerance below which it does not 
need to come. 

With the above classification of sources, we present a classification 
of errors as such. This is to some extent artificial, since errors arising 
from the various sources interact in a complex fashion and result in a 
single error which is no simple sum of elementary errors. Nevertheless, 
thanks to a most fortunate circumstance, it is generally possible to 
estimate an over-all range of uncertainty as though it were such a simple 
sum (1.2). Hence we will distinguish propagated error, generated 
error, and residual error. 

At the outset of any computation the data may contain errors of 
measurement, round-off errors due to the finite representation in some base 
of numbers like }-<j, or even numbers requiring a finite but large number 
of places for exact representation. These initial errors carry through the 
computation and lead to an uncertainty at every step. It is important to 
know how these initial errors are propagated through the computation 
and to what extent they render the results uncertain. 

In addition to this, at every step, or nearly every step, new errors 
may arise as a result of round-off, these combine with the errors already 
propagated, and the total is propagated through what computations 
remain. Finally, when the computation is terminated, a truncation 
error may remain and further enlarge the region of uncertainty. Roughly 
the extent to which errors are propagated and the uncertainty due to 
residual error depend upon the mathematical formulation of the compu- 
tational procedure, while the generation is more dependent upon the 
detailed ordering of the computational steps. 

Any computation, however elaborate, consists of a finite number of 
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elementary operations carried out in some sequence. The elementary 
operations are usually additions and subtractions, multiplications and 
divisions, comparisons, possibly table look-ups, and the like. An unam- 
biguous description of the sequence in which the operations are performed 
or to be performed, with a specification of the data upon which each 
is to operate, constitutes a routine. If multiplication and division are 
elementary operations, there are six possible routines for computing 
ab/c. Hence a routine is by no means defined when a mathematical 
formula, or sequence of them, is written down. 

A routine of any complexity breaks up naturally into parts or sub- 
routines. A subroutine may have for its purpose the computation of an 
intermediate quantity, of no interest in itself but serving as a datum or 
operand for one or more subsequent subroutines. Thus, in order to 
calculate ab/c, one must first calculate ab, or a/c, or b/c. Or a subroutine 
may operate upon intermediate results to produce a final result. 

Suppose a subroutine is intended to compute a function f(x, y, . . .) 
for given values of its arguments. If / is a rational function or a poly- 
nomial, there need be no residual error in the computation, but only 
propagated and generated errors. If / is not a rational function, some 
rational approximation must be devised. One type of rational approxi- 
mation is a Taylor series. If a Taylor series is used, only a finite number 
of terms will be computed. The residual error in the computation is the 
sum of all neglected terms. Hence the residual error is fixed by the 
mathematical formulation of the problem, together with the specification 
of the number of terms to be used in the computation. But an error 
may be generated and propagated in each computed term. 

Another type of approximation, e.g., in solving an equation by New- 
ton's method, is the following: From an initial approximation / , one 
defines a sequence of approximations by a relation /,-+i = <(/), where <J> 
is some function which can be evaluated. If the mathematical sequence 
converges, one may take as a criterion for terminating the sequence the 
condition that successive terms shall differ by less than some assigned 
quantity. Clearly the assigned quantity must not be smaller than the 
smallest quantity representable by the machine, and normally it will be 
some integral multiple of this. It will have to depend upon the error 
generated by the routine for evaluating <. The residual error is the 
difference between the /* which one accepts and the true value of /, and 
it is limited by the quantity used in the criterion which is, in turn, limited 
by the precision of the machine operations. For illustration, a routine 
for computing square roots will be considered in 1.5. 

1.2. Composition of Error. Let x*, y*, . . . designate numbers which 
might occur as data or as results of a particular computation. That is 
to say, x* has a form 
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(1.2.1) X* = 



where is the base, usually 2 or 10, X is a positive integer, and <s any 
integer, possibly zero. It may be that X is fixed throughout the course 
of the computation, or it may vary, but in any case it is limited by prac- 
tical considerations. Such a number will be called a representation. 
It may be that x* is obtained by "roimding off" a number whose true 
value is x (for example, x = ^, x* = 0.33), or that x* is the result of 
measuring physically a quantity whose true value is x, or that #* is the 
result of a computation intended to give an approximation to the quan- 
tity x. 

Suppose one is interested in performing an operation w upon a pair 
of numbers x and y. That is to say, xuy may represent a product of x 
and y, a quotient of x by y, the yth power of a:, .... In the numerical 
computation, however, one has only x* and y* upon which to operate, 
not x and y (or at least these are the quantities upon which one does, in 
fact, operate). Not only this, but often one does not even perform the 
strict operation w, but rather a pseudo operation co*, which yields a 
rounded-off product, quotient, power, etc. Hence, instead of obtaining 
the desired result xuy, one obtains a result x*u*y*. 

The error in the result is therefore 

(1.2.2) xuy x*u*y* = (xuy x*uy*) + (x*uy* z*co*y 



Since x* and y* are numbers, the operation w can be applied to them, and 
x*wy* is well denned, except for special cases as when w represents 
division and y* = 0. But the expression in the first parentheses on the 
right represents propagated error, and that in the second parentheses 
represents generated error, or round-off. Hence the total error in the 
result is the sum of the error propagated by the operation and that gen- 
erated by the operation. 

It may happen that the two errors are opposite in sign, though often 
the sign is not known, but only the magnitude. In any case, 

(1.2.3) \x<ay x*u*y*\ < \xuy x*uy*\ + \x*uy* x*w*y*\, 

and one can say at least that the total error does not exceed the sum of 
the two errors. 

That propagated and generated errors depend upon the details of the 
routine, such as the value of X for any representation, and even the order 
in which certain operations are carried out, is easily seen. Thus, to con- 
sider only round-off, if </> is some operation, it may be that mathematically 

(x*ay*)<t>z* = s* 
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That is, the two operations may be associative mathematically. Never- 
theless, 



whereas 

- a; 



4*z*) - x*u*(y*4>*z*)]. 

Thus the errors generated by performing the operations in the two pos- 
sible ways have different expressions, and cannot be assumed equal 
without proof. 

1.3. Propagated Error and Significant Figures. Let 

(1.3.1) * = ** + , y = y* + n, . - - , 

and consider the problem of evaluating the function f(x, y, . . .). If the 
function can be expanded in Taylor's series, then 

(1.3.2) f(x, y, . . .) - /(**, y*, ...)-*/ + u/v +'' 



where the partial derivatives are to be evaluated at x *, y*, . . . . This 
represents the error in / arising from errors in the arguments, i.e., the 
propagated error. Generally one expects the errors , rj, . . . to be 
"small" so that the terms of second and higher power can be neglected. 
If so, then the propagated error A/ satisfies, approximately, 



(1.3.3) |A/| < I*/, + [,,/j + . 

This is strictly true when / is a simple sum: 

/ = +x y . 

Hence the error in a sum does not exceed the sum of the errors in the 
terms. 

One can, by direct differentiation, write down any number of special 
relations (1.3.3) based upon the assumption that the errors in the argu- 
ments are small. Nevertheless, for the detailed analysis one must go to 
the individual elementary operations. 

Consider the case of the product and quotient. For the first we have 

xy - x*y* = x*y 4- 2/* + &. 

It is sometimes convenient to consider the relative error, which is the ratio 
of the error to the magnitude. Hence 

xy ~ x * y * - * ** * ^ 



-r - -r -r -r rr 

x*y* x* y* x* y* 
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Usually one says that the relative error in a product is the sum of tKe 
relative errors in the factors, and this is approximately true if these rela- 
tive errors are both small, but only then. 
For the quotient 



x x* _ y*rj 



y y* y*(y* 4- i?) 

and the relative error is 

(\ Q ^ x /y ~ x */y* ^ x * ~~ y/y* 

(1-0. 5) A / 5k == ~~1 . 7~~* ' 

x*/2/ 1 4- *?/2/ 

If the relative error ij/y* is negligible, then the relative error in the 
quotient does not exceed in magnitude the sum of the magnitudes of the 
relative errors in the terms. Nevertheless, if i\/y* < and not numeri- 
cally small, the conclusion does not follow. 

If x* t given by (1.2.1), represents a number whose true value is x, 
and if 

(1.3.6) IB* - x\ < 0- x /2, 

the digits x i, . . . , x\ may be said to be reliable or significant. If xi j^ 0, 
then x* is said to contain X significant figures. However, if the inequality 
(1.3.6) is not known to hold, the last digit, x\, is "in doubt." When a 
number such as x* is written in its usual form as a sequence of digits with 
a decimal point, it is usually understood that all digits are significant, 
from the first non-null digit to the last digit written, unless limits of error 
are specified. Some textbooks give various rules of thumb for deter- 
mining the number of significant digits in computed quantities, and the 
implication is sometimes left that nonsignificant digits should be dropped 
before making subsequent computations with these numbers, or at least 
that nothing is to be gained by retaining the doubtful digits. But this is 
clearly not the case, since dropping these digits generally enlarges the 
region of uncertainty. 

1.4. Generated Error. In many of the newer automatic computing 
machines, the "built-in" arithmetic operations are designed to yield 
correct results only when the operands as well as the results are numbers 
of a form (1.2.1), where X is fixed, and a = 0. Such a number von Neu- 
mann and Goldstine call a "digital number." Any digital number is 
therefore less than unity. 

Given two digital numbers a* and 6*, the machine will correctly form 
the sum only if 

|o* 4- 6*| < 1, 

and will correctly form the difference if 

la* - 
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But if the one condition or the other holds, the machine will correctly 
form the sum or difference, as the case may be, and no round-off is 
generated. Hence if a* + b* or a* b* is digital, a* + 6* or a* b* 
can be formed, and formed correctly, without generating any new error. 
If a* and 6* are digital, then necessarily 

|a*6*| < 1. 

However, the true product a*b* is a number of 2X places. If the machine 
holds only X places, it will not form the true product a*b*, but a pseudo 
product. It can be represented a* X b*. It may be that the machine 
merely drops off the last X places from a*b*. Or it may be that the 
machine first forms a*b* + /3~ x /2 and then drops the last X places. In 
the latter event the pseudo product satisfies 

(1.4.1) |a*6* - a* X b* < 6 = /3- x /2, 

where is introduced to simplify notation. Let us assume that (1.4.1) 
holds. 

For division, the quotient a*/b* will usually require infinitely many 
places for its true representation, even though a* and b* are both digital. 
The machine, however, can retain only the first X, and it may compute the 
first X places of a*/b* and drop the rest, or it may compute the first X 
places of a*/b* + 0~ x /2. In the latter event, the retained X places 
represent the pseudo quotient a* -r- 6*, which satisfies 

(1.4.2) \a*/b* - a* * b*\ < e. 

Given a series of n products a*b* to be added together, we have 

(1.4.3) | So? X bf - Sa?6?| < ne. 

However, instead of recording each product as a digital number and add- 
ing the results, it may be possible for the machine to retain and accumu- 
late the true products of a* and &*, rounding off the sum as a digital 
number. This pseudo operation may be designated 2*o*6*, and for this 
we have 



(1.4.4) 
While 

(1.4.5) a* X b* = 6* X a*, 

equalities in terms of arithmetic operations do not always hold strictly 
when these are replaced by pseudo operations, as was already shown in 
general. In particular, 

I (a* + 6*) X c* - (a* X c* + b* X c*)| < 3c, 
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since each pseudo multiplication could give rise to an error e. However, 
the two quantities being compared can differ only by a digit in the last 
place, which is to say by an integral multiple of 2e. Consequently we 
can improve this by saying that 

(1.4.6) | (a* + b*) X c* - (a* X c* + b* X c*)| < 2e. 

In order to examine the effect of grouping in a continued product, we 
note that 

a* X (b* X c*) - a*b*c* = [a* X (6* X c*) - a*(b* X c*)] 

+ [a*(b* X c*) - a*b*c*] 
so that 

|a* X (b* X c*) - a*b*c*| < (1 + |o*|) < 2. 

If we now interchange a and c and add results, we have 

|o* X (b* X c*) - (a* X b*) X c*| < (2 + |a*| + |c*|) . 

But if a* = c* = 1, the left-hand side is zero, and otherwise the left-hand 
side is less than 4e. But it must be an integral multiple of 2e, so that 

(1.4.7) |o* X (b* X c*) - (a* X 6*) X c* < 2e. 

The two pseudo products either agree or differ by one in the last place. 
Finally consider 

(a* + b*) X b* - a* = [(a* -*- b*) X 6* - (a* -s- b*)b*] 

+ (a* -s- b* - a*/b*)b*. 

Since this is less than (1 -f |b*|)e, it is actually less than 2c, and since it is 
an integral multiple of 2e, it vanishes. Hence 

(1.4.8) (a* -j- b*) X 6* = a*. 
However 

(a* X 6*) -r- 6* - a* = [(a* X b*} + b* - (a* X 6*)/6*] 

+ [(a* X 6*) - a*6*]/fe*, 
so that 

(1.4.9) |(o* X 6*) *&*- o*| < (1 + \b*\- 1 )*. 

If 1 6* | is small, \b*\~ l will be large, and the error can be large. However, 
bear in mind that this comparison is made for pseudo operations in which 
the rounded-off product is used. If the machine retains the complete 
product to use as the dividend, the error will not arise. 
More generally, and in the same way, we have 

(1.4.10) |(o* + 6*) X c* - (o*/&*)c*| < (1 + |c*|)e, 
while 

(1.4.11) |(o* X b*) + c* - o*b*/c*| < (1 + |c*|- l )e. 
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Interchange of a* and c* in (1.4.10) gives 

|(c* * *>*) X a* - (c*/&*)a*l < (1 + |o*|)e, 
so that by addition 

| (a* -*- 6*) X c* - (c* -s- 6*) X o*| < (2 + |o*| + |c*|)c. 

But the left member vanishes when o*| = |c*| = 1, and otherwise the 
right member is less than 4c. Hence in any event 

(1.4.12) |(o* -s- 5*) X c* - (c* -* 6*) X o*| < 2e. 

1.6. Complete Error Analyses. It was shown above that the magni- 
tude of the error in the result of any subroutine cannot exceed the sum 
of the magnitudes of the propagated, generated, and residual errors. 
In particular instances a routine can be devised such that the error due 
to one of two sources necessarily is of one sign arid that due to the other 
is of the opposite sign. In such a situation the resultant error cannot 
exceed the larger of the two individual errors. But to devise a routine 
having this property, or even to discover that a given routine has it, may 
take a disproportionate amount of time and not be worth the effort. 
Generally one must assume that the errors can build up, and attempt to 
devise a routine that will keep all errors as small as possible. And in any 
event one must somehow balance the time to be spent on an analysis 
against the time required for the computation and the allowed tolerances. 

Some computations are of such frequent occurrence that it may be 
worth while devoting considerable time and effort to the design of an 
optimal routine and a precise error analysis for that routine. We give an 
example or two. 

Consider a binary computing machine with X magnitude digits and one 
sign digit, representing only numbers of magnitude less than unity. This 
machine will compute a (2X)-digit product and accept a (2X)-digit divi- 
dend, but the digital pseudo products and pseudo quotients will be 
supposed to satisfy, in general, 

0<<z6-aX&< 2~ x , 
< a/6 - a -f- b < 2~ x , 

the latter relation presupposing the division to be possible. 

We require an optimal routine and precise error analysis for \/o, using 
Newton's method. This means that the number a whose square root is 
required is a digital number, < a < 1, and the routine is to yield a 
digital number x for which the maximal (x \/~a) is to be as small as we 
can make it. Indeed if the routine is properly constructed, then there 
should be some half-open interval of length 2~ x which contains both 
and x, ' ' 
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By Newton's method one takes some positive x' Q < 1 and forms 

(1.5.1) *;+!=< -(*$ -a/*5)/2, 

and the sequence can be shown to approach -\/a as a limit. Moreover, if 
one takes x' Q = 1 , then surely 

(1.5.2) s$ > Vo 

when a' = 0, and one can show inductively that this relation holds for 
every i. In fact, one verifies directly that 



Nevertheless, the sequence one actually forms in the machine is not 
strictly defined by (1.5.1), but instead by 

(1.5.3) Xi+i = Xi (xi a -T- x^ -5- 2. 

(All numbers will be digital numbers, and the asterisk can be omitted.) 
There is no a priori assurance that the numbers x { will satisfy (1.5.2), and 
this point must be investigated. 

We first show that, if y and z are any two digital numbers satisfying 

(1.5.4) (z - a -r- z) + 2 > (y - a -f- y) -i- 2 > 0, 
then 

(1.5.5) z > y > \/a. 

The second inequality in (1.5.4) is equivalent to 

(y-a + y) +2> 2-*, 



y > 2-x+i + a -f- y > 2~ x + a/y > a/y, 

whence y* > a, which proves the second inequality in (1.5.5). By the 
first inequality in (1.5.4), 



But 

(z a * z)/2 > (z a -f- z) -f- 2, 
(y-c-&-y)^2>(y-a-f- y)/2 - 2-*-', 

since in forming the pseudo quotient by 2 (which is a single shift to the 
right) the error is either or 2~ x-1 . Hence 

(z - a + z)/2 > (y - a + y)/2 + 
z a + z>y a + y -{- 2~ x . 
Again, 

a -T- z a/z > 2~ x , 
a -5- y - a/y < 0, 
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whence 

z - a/z > y a/y. 

But the function /() = z a/z is properly monotonically increasing. 
Hence f(z) > f(y) implies the first inequality in (1.5.5). 
This implies, in particular, that, if 



i a -f- x^ -f- 2 = Xi Xi+i > 0, 
then Xi > \/a. If it should happen that 

(1.5.6) (^ - a 4- xt) + 2 < 0, 

then clearly we should take Xi and not Xi+i as x. On the other hand, if 

(1.5.7) (x^i - a -T- Xi-i) 4- 2 > 2~ x , 



we shall take at least one more step in the iteration. 
Suppose the equality holds in (1.5.7). Then - 

Xi = x^i - 2~ x , 

(xi-i a -s- z_i)/2 > fa-i a -r #,-_i) -5- 2 = 2~~ x , 
whence 

rc t -_i a -5- 
Hence 

> a ^ ^-.i + 
a;,- > 
<_i > 

Xi + 2~ x > a/a^i, 
Xi a/Xi > 2~ x . 



This holds a fortiori if the inequality holds in (1.5.7). Hence in all cases 

x a/x > 2~ x , 
(x + 2~ x - 1 ) 2 > a 4- 2- 2X ~ 2 , 

(1.5.8) x > (a + 2~ 2X - 2 )> - 2- x ~ 1 . 

This gives a lower bound for the computed value. 
Next suppose 

(Xi - a -T- x^ -f- 2 < 0. 
Then 

Xi- a^- Xi < 2~ x , 

T . <^* n -2- ^* . I O X ^" /* /'Y*. I O X 
l/t ^*> Cl' l/j (^ ^ ^>* W/ U/i |^ & 

Hence in all cases 

x<a/x + 2~\ 
(x - 2- x ~ 1 ) 2 < a 4- 2~ 2X - 2 , 

(1.5.9) a; < (a 4- 2- 2X ~ 2 )^ 4- 2- x -!. 

Inequalities (1.5.8) and (1.5.9) define a half-open interval of length 2~ x 
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and center (a + 2~ 2X ~ 2 ) W upon which x must lie. These inequalities can 
be written 



(1.5.10) (x 2 - 2-V)* < 

In the worst case, when a = 0, x = 2~ x , and the error is 2~" x . In all other 
cases the error is less. The case a = 0, x 2~ x could arise in machine 
computation when a is an intermediate result, obtained from previous 
computation. 

If a itself can be in error by an amount a, then by 1.3 the propagated 
error is approximately a/(2 \/a), when a -^ 0. Hence the total error 
cannot exceed 2~~ x + a/(2 \^a), when a 7^ 0. If a represents the maxi- 
mum possible difference between a + 2~ 2X ~ 2 and the true value for a, then 
the maximum error can be written as 2~ X ~ I + a/(2x). 

This case is especially favorable because of the fact that round-off 
errors do not build up in the course of the computation. At each step 
the best available approximation is used as the basis for obtaining a 
better one, and one continues to get improvement until a stage is reached 
at which the error inherent in a single step is as great as the truncation 
error. 

By way of comparison, consider a computation based upon Taylor's 
series, where the round-off can accumulate. We require the evaluation 
of both sin x and cos x, for \x\ < x/4 < 1, to be followed by a check 
based upon the identity sin 2 # + cos 2 a; = 1. Clearly the computed values 
of sin x and cos x will not necessarily satisfy the identity strictly. How 
close can we get to the true values of sin x and cos x, and how closely can 
we expect our approximations to satisfy the identity? 

We shall, in fact, describe a routine for computing s, an approximation 
to sin x, and c, an approximation to 1 cos x. Let Wi = x, a digital 
number, and 

w n = (x/n)w n -ij 

wl = (x + n) X w*_ r 

For these operations it is assumed that 



|a*6* - a* X fc*| < 2 



40 



The terms w n are the terms which appear in the expansions, while w* are 
the terms we actually obtain in the computation. For this machine 
X = 39, and the machine accepts a (2X) -digit dividend so that divisions 
x -s- n are performed by dividing (2~ x #) by (2~ x ). Let 



Then 

w* - w n [(x + n) X <_! - (x 4- n)w>J_J + [(* -f- n) - 

_ l 



(x/n)(w* 
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The division steps satisfy 

< x/n - x + n < 2~ 39 (w - l)/n. 
Also 

\w*\ < l/nl 
Hence 

n < 1 + 2(n - l)/n! + _i/n. 

The residual error is less than the first neglected term, and for n > 15, 
\w n \ < 2~ 40 = 2~ x-1 . Hence on solving recursively and adding the 
generated errors (the e's) and the residual errors, we have 

c - (1 - COBS) I < 1.197 X 2~ 37 , 

|s - sin x\ < 1.140 X 2~ 37 . 
For the check let 

cos x = 1 c + c, sin x = s + c', 

where and e' are bounded by the right members* of the above inequalities. 
Then 

2c - c 2 - s 2 = 2e cos x + 2e' sin 3 - e 2 - e' 2 . 
Hence 

|2c - c 2 - s 2 | < 2e'(cos a; + sin |z|) + 2(e - e') cos a; 

< 2e' \/2 H- 2(e - e') 

< 1.669 X 2~ 36 . 
Hence 

|2c - c X c - s X s| < |2c - c 2 - s 2 | + |c X c - c 2 | + | X s - s 2 | 

< 2- 35 . 

Thus in applying the check, we compute a quantity which should vanish 
if there were no errors due to truncation or round-off, but on the basis of 
this analysis we can say only that the computed value must be less than 
2~ 36 . If a larger value occurs, it must be due to blunders. 

This analysis shows only that the quantity computed for the check 
cannot be so great as 2~ 35 in magnitude. It does not show that quantities 
as great as 15 2~ 40 could, in fact, occur, nor even quantities as great as 
2~ 86 . Hence a more detailed analysis, which pays attention to the pos- 
sible signs of the errors at each step, might yield a somewhat smaller 
bound. 

1.6. Statistical Estimates of Error. The discussion in 1.3 shows what 
is intuitively obvious to begin with : that given n numbers, each of which 
can be in error by as much as c, their sum can be in error by as much as 
we. However, the occurrence of this greatest possible error will be 
extremely infrequent in practice. Even assuming that the error in each 
term is maximal, which is improbable in itself, the probability is only 2~ n 
that the maximal error of ne would occur in the sum, since that would 



THE ART OP COMPUTATION 15 

require that the individual errors all be of the same sign. If one can 
assert of each term x in the sum that its error can have any value between 
and c, say with uniform probability, then the probability of occurrence 
of the maximal error nt becomes much smaller than 2~ n . 

As with sums, so with any other computation or sequence of computa- 
tions, the probability of occurrence of a maximal error may be extremely 
small. It is reasonable to inquire, therefore, as to the probability that 
the accumulated error in a given computation will exceed some assigned 
limit. A probabilistic approach is the more clearly indicated if one con- 
siders the fact that limits of error in measured quantities can seldom be 
assigned with certainty. At best one can assign a probability to the 
assertion that the error of measurement does not exceed some given 
amount. 

In principle a statistical estimate of errors can be made by going 
through the same steps as in a strict estimate, except that at each step one 
requires a distribution of errors in the data and seeks a distribution, 
rather than strict limits, for the errors in the result. Unfortunately, 
besides the fact that the computation of these distributions is intrinsically 
difficult, questions of statistical independence are especially troublesome. 
Consequently we mention this approach only in passing and point out 
that there is a growing literature on the subject, a few titles of which are 
listed among the references. 

1.7. Bibliographic Notes. The subject of errors is given at least 
casual discussion in most standard texts, either as a separate topic or in 
connection with particular computations. Most papers in the periodicals 
deal with errors in particular computations. The four sources of error 
are distinguished, digital numbers denned, and the error formulas of 1.4 
are given in von Neumann and Goldstine (1947), where the major topic 
is errors in matrix computations. Turing (1948) also discusses matrix 
operations in some detail. Rademacher (1947) and Plarrison (1951) 
discuss errors in the numerical solution of differential equations, but these 
papers are also of more general interest. Dwyer (1951) discusses errors 
at length. 

On statistical assessments Inman (1950) discusses the problem in 
general, Rademacher (1947) makes applications to differential equations, 
while Huskey (1949) finds a failure which is explained by Hartree. 
Goldstine and von Neumann (1951) give statistical estimates in matrix 
computations. 

Papers yet to be published by Goldstine, Murray and von Neumann, 
and J. W. Givens, each concerned with the problem of finding proper 
values of matrices, will contain elaborate error analyses. It seems safe 
to predict that an increasing number of detailed analyses of specific 
routines, such as the ones given here for the square root and the circular 
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functions, will be issued for limited distribution by groups operating high- 
speed computing machines. 

On the operation of automatic digital computers see Berkeley (1949), 
Wilks, Wheeler and Gill (1951), Engineering Research Associates (1950), 
and the (Harvard) Computation Laboratory (1946, 1949). In addition 
to these references, which are listed in the bibliography, there are numer- 
ous reports and memoranda issued by organizations which build or 
operate particular machines: IBM Corporation, MIT, Harvard Com- 
putation Laboratory, University of Illinois, NBS, Institute for Advanced 
Study, and a number of others. A section of the periodical MTAC is 
devoted to electronic computers. And finally, abstracts or complete 
papers presented at meetings of the Association for Computing Machinery 
are obtainable. 



CHAPTER 2 
MATRICES AND LINEAR EQUATIONS 



2. Matrices and Linear Equations 

The numerical solution of an integral equation, of a partial differential 
equation, or of an ordinary differential equation with two-point boundary 
conditions is generally obtained by solving an approximating linear 
algebraic system. Moreover in order to solve a nonlinear problem, one 
may replace it by a sequence of linear systems providing progressively 
improved approximations. For these reasons, and because of the theo- 
retical simplicity, we start with linear systems of equations. For study- 
ing linear systems of equations a geometric terminology, with the compact 
symbolism of vectors and matrices, is extremely useful. A re*sum6 of the 
basic principles is therefore included. 

2.01. Vectors and Coordinate Systems. Any n vectors 61, . . . , e n are 
said to be linearly dependent in case any of them can be expressed as a 
linear combination of the others. A more symmetric statement of the 
same property is that there are scalars i, . . . , not all zero satisfying 

' * ' + "nCn = 0. 



The equivalence of the two statements is made clear by considering that, 
if en 7* 0, then we could solve for e in terms of the other vectors. As an 
example, two vectors are linearly dependent in case they are parallel, or 
if one is the null vector. 

A vector space is n-dimensional in case there are n linearly independent 
vectors in the space, but any n -f- 1 vectors are linearly dependent. Let 
61, . . . , e n be linearly independent, and let x be any vector in the space. 
These n -h 1 vectors are linearly dependent. Hence we can find scalars 
ft i> ) n> n ^ all zero, such that 

But certainly, then, ' 7* 0, since if we had % = 0, the relation would 
express the linear dependence of the set of e's, whereas they are taken to 
be linearly independent. We can therefore solve for x and write 

(2.01.1) x = 

17 
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is called a matrix, and the column (2.01.7) a (numerical) vector, and we 
designate these E and x, respectively. The rule, then, is the rule for 
multiplying a matrix by a vector, 

(2.01.8) x f = Ex, 

where x' is the column of the j. 

Besides the vectors e, and f,, let gi, . . . , g n represent also a set of 
linearly independent vectors in the same space, and let 

*y = 2, ***' 
Then 

Ci = fjtji = 



Hence if F represents the matrix of the 



ijl ' ' ' > 4 1ft 




and if P represents the matrix 

(2.01.9) P = ( **) = [ ' 

I 

< j 

we shall say that P is the product of the matrices F and E : 

(2.01.10) P = FE, 

with (2.01.9) giving the rule for multiplication. This is consistent with 
the rule for multiplying a matrix by a vector. Note that the product 
EF is, in general, not the same as the product FE. 
In the particular case when 

then 

and it must then be true that 

r* 

; = 8 



3 

where 6^ is the Kronecker delta, denned by 



(2.01.11) fc, = ? W !j en ? * '.' 

^ ' - 1 when /c ~ i. 
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In this event the matrix P has the simple form 

'100 

(2.01.12) 



and is called the identity matrix 7. Since, in that case, 

7 =FE, 

one says that the matrices F and E are reciprocals: 

FTP 1 
== j 

This is one case in which the order of multiplication is immaterial: 

7 = EE~ l = E~ 1 E. 

The matrices introduced so far have been square matrices, but matrices 
may also be rectangular, and in particular the veetors x and x' discussed 
above may be regarded as matrices of n rows and one column each. We 
may also have a matrix of one row and n columns. Such a matrix would 
be called a row vector, while the vectors x and x' are column vectors. 

We can extend the notational scheme by writing 

(2.01.13) e = (ei e 2 e n ), 

and treating e formally as though it were a (numerical) row vector, 
which enables us to write (2.01.3) in abbreviated form 

(2.01.14) e fE. 
Then, also, 

nm Jj1 

' ^f tj 

whence, by formal substitution, 

e = gFE. 

This is consistent with previous results, where we found that 



with 

F - FE. 

2.02. Linear Transformations. A transformation of vectors (as dis- 
tinguished from a change of coordinates) is a natural generalization of the 
notion of a function. A transformation of vectors is a rule which, to 
every vector of the space under consideration, associates a unique vector 
in the space, and this vector is called its transform. The transformation 
is linear if to the sum of two vectors corresponds the sum of the trans- 
forms and to a scalar multiple of a vector corresponds the same scalar 



MATRICES AND LINEAR EQUATIONS 21 

multiple of the transform. In many contexts, especially in discussions of 
quantum mechanics, it is customary to speak of the transformation 
as being an operation performed by an operator upon the vector and 
yielding the transformed vector (the result of the transformation). 

An equivalent definition, and the one which will be used here, is the 
following: If T(x) designates the transform of x, then the transformation 
is linear in case 

T(x) = 2fcT(e,-), 

when x is given by (2.01 . 1) . In accordance with the abbreviated notation 
this can be written 

(2.02.1) T(x] = T(e)x. 

Since each T(e,-) is in the space of the e,, it can be expressed 

T(e,) = 26,-Tv;, 
or in abbreviated form, 

(2.02.2) 7 T (e) = eT, 
where T is the matrix 

(2.02.3) T = (TV,-). 
But from (2.02.1) 

(2.02.4) TOO = eTx. 

Hence Tx is the numerical vector representing T(x) in the coordinate 
system of the e,-, and the matrix T represents the transformation (more 
strictly, the operator) in that same coordinate system. 

In a different coordinate system, however, x, TOO* and the trans- 
formation itself are otherwise represented. In fact, if 

x ex, 
and 

e = IE, 
then 

x = f Ex = fx', 

so that, as we have already seen, x' = Ex represents x in the coordinate 
system f . Now 

T(x) = eTx = fETx, 
and 

x = E~ l x'. 
Hence 

T(x) = tETE- l x'. 

Hence 

(2.02.5) T' - ETE~ l 
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is the matrix which represents the transformation in the coordinate 
system f, since this is the matrix which, when applied to the numerical 
vector x' (which represents x in that system), will yield the numerical 
vector representing !T(x) in that system. 

2.03. Determinants* and Outer Products. An outer product of two 
vectors a and b is a new type of geometric entity, defined to have the 
following properties: 

(1) < [a,b] = -[b, a]; 

(2 2 ) [aa, b] = *[a, b] ; 

(3.) [a, b] + [a, c] = [a, b + c]. 

It can be pictured as a two-dimensional vector, whose magnitude, taken 
positively or negatively, is the area of the parallelogram determined by 
the vectors in the product. It follows immediately from (1 2 ) that 

[a, a] = 0. 
Hence 

[a, b] = [a, b] + [a, a] = [a, b] + [a, a] = [a, b + a]. 

Hence for any scalars a and /3, 

[a, b] = [a, b + a] = [a + 0b, b]. 

If 61, 62 are any linearly independent vectors in the space of a and b, 
then 

(2.03.1) [a, b] = \a 6|[ei, e 2 ], 
where 

(2.03.2) a b\ = <*i/3 2 - <* 2 i 

is called the determinant of the numerical vectors a and b. The evalua- 
tion is immediate: 

[a, b] = [a, /Sid + /3 2 e 2 ] = fr[a, ej + 2 [a, e 2 ] 

2 e 2 , ei] + 2 [aiei + 2 e 2 , e 2 ] 
-f- j9 2 i[ei, e 2 ] 
i)[ei, e 2 ]. 

The determinant is a number, and its relation to the outer product is 
similar to that of the coordinates to a vector. 

It is a simple geometric exercise to show that (3 2 ) holds in the paral- 
lelogram interpretation when a, b, and c are all in the same 2 space. 
When they are not, the relation serves to specify the rule of composition. 

For outer products of n vectors the defining relations are sufficiently 
typified by the case n = 3 : 

(1 8 ) [a, b, c] - -[a, c, b] = -[c, b, a] = 

(2i) [a, b, c] = <x[a, b, c]; 

(3s) [a; b, c] + [a, b, d] - [a, b, c + d]. 
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From these we deduce that 

[a, a, c] = = 0; 
[a, b, c] = [a + 0b, b, c] == [a, b, c + aa] = ; 

and if ei, 62, 63 are linearly independent vectors in the space of a, b, and 
c, then 

(2.03.3) [a, b, c] = \a b c|[ei, e 2 , e 3 ], 

where a b c is called the determinant of the numerical vectors a, b, 
and c, and its value will now be obtained. Note first that, if 

a' = aiei + 2 e 2 , b' = frd + 2 e 2 , 
[a, b, e 3 ] = [a', b', e 3 ] = (i/3 2 - 2/3i)[ei, c 2 , e :t ]. 



then 



In fact, the identical steps that led to (2.03.1) and (2.03.2) will, if applied 
to the second member of this last equality, yield the third member. 
Now by an obvious modification we obtain 



and again 



[a, b, e 2 ] = ( 3 /3i i/3 3 )[ei, e 2 , e 3 ], 

[a, b, ej = ( 2 /3 3 - 3 2 )[ei, e 2 , e 3 ]. 
By putting these together we obtain 

(2.03.4) a b c\ = 7i(a 2 /3 3 - a 3 2 ) + 72(3/8i - i0 3 ) 

When we write the determinant explicitly in the form 



\a 



c = 



|8i 7i 
182 72 
j8 3 73 



we see that in the expansion of the determinant in terms of the 7*8, the 
coefficient of each 7; is, except for sign, that second-order determinant 
that remains after deleting the row and column containing 7*. The sign 
is that power of 1 whose exponent is obtained by adding together the 
number of the row and column. Thus 72 is in the second row, third 
column, and the sign is ( 1) 2+3 . The coefficient of 7* with its proper 
sign is called the cofactor of 7,-. By interchanging rows and columns and 
going through the same process, we find 

|a 6 c| = aiAi + 2 A 2 + 
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where the capital letters signify cof actors formed by the same rule. It is 
also true that, for example, 

= |a b a = ail\ -f- a 2 r 2 -f- a z Tz, 

and, in general, when the elements of any column are multiplied by the 
cofactors of some other column and the products summed, the sum 
vanishes. 

Finally, we have the expansions 

\a b c = ai Ai + 0iBi + 



These are the expansions we should get if we were to rewrite the deter- 
minant, writing the rows of the original as columns of the new one, and 
these equations say in effect that this exchange of rows for columns leaves 
the value of the determinant unaltered. It is quite. clear that such an 
exchange leaves the value of a second-order determinant unaltered. 
Hence it is clear that in the expansion of either third-order determinant, 
the original or the transposed, the coefficient of any element is the same. 
The theorem follows because, when the determinant is expressed as an 
explicit function of its elements, each term contains as a factor one and 
only one element from each row and one and only one element from each 
column. 

The recursive extension to successively higher dimensions can be made 
by following the same pattern, and the formulas need not be written 
explicitly. For each extension the expansion is made first along a par- 
ticular column, and one observes that it is equally possible to expand 
along any row. 

For determinants of order 4 or greater another type of expansion is 
possible, called the Laplace expansion. To describe this it is convenient 
to introduce the symbolism 



ai 

\ a i i = 

If, now, 



a i o 

ay ft 



a = aiCi + 0:262 + #363 + 0:464, 

d = 5i6i + 5 2 e 2 + 5 3 e 3 + 5 4 e 4 , 
then we can see that 

[a, b, 63, 64] = |ai ^2J[ei, 2, 63, 64], 



and 

i, e 2 , c, d] = \y z 5 4 |[ei, e 2 , e a , e 4 j. 
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Hence we shall find that in the expansion of \a b c d\ there will 
appear terms which make up the product |i ^2! ITS 5 4 |. But if we 
interchange 62 and 63, say, we shall find that there are also terms making 
up the product |i ftz\\yz $4 , and there is no term common to the 
two products. When all possible interchanges are made that yield 
distinct products, we find 

|a b c d\ = 



To determine the sign in each case we note, for example, that in the 
expansion las j34\\yi dz\ would appear as the coefficient of [63, 64, 61, 62], 
but that this is equal to [d, e 2 , e 3 , cj, whence the sign is plus. 

Note finally that the determinant of the product of two matrices is the 
product of the determinants. In fact, if 

e = f E, f = gF, 
then 

e = gFE. 
But on the one hand, 



l> . . . , e] = \E\[t lt . . . , f J 

= \E\\F\(g l} , . . ,gn], 

and on the other hand, 



, gn]- 

Hence 

\FE\ = \F\\E\. 

2.04. Length and Orthogonality. Geometrically the scalar product 
xy of the vectors x and y is equal to the product of their lengths into 
the cosine of the angle between them, or the product of the length of one 
vector by the projection of the other upon it. It is clear geometrically 
that the projection of a broken line upon a given line is the sum of the 
projections of the separate segments and is equal also to the projection 
of the single segment which joins the two ends of the broken line. Hence, 
if x is given by (2.01.1), its projection upon y is the sum of the projections 
of the segments ,e, upon y. Hence 

xy = (2fcei)y. 
But by the same rule, if 

Y = 
then 



and therefore 

(2.04.1) xy = 
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Now the scalar products 

(2.04.2) 7* = e<e y = 7* 

are known once the vectors e are themselves known. Hence the scalar 
product of any two vectors can be calculated from (2.04.1). 

If each column of a matrix M is written as a row, the order remaining 
the same, the resulting matrix is known as the transpose of the original 
and is designated M T . In particular, if x is the column vector of the ,, 
x* is the row vector of the ,. With this understanding, if G is the matrix 

(2.04.3) G = (7*) = e T e, 

this is said to define the metric in the space, and Eq. (2.04.1) becomes 

(2.04.4) xy = x^Gy = y^Gx. 



The matrix G is equal to its own transpose and is said to be symmetric. 
When the metric (7 is known and fixed throughout the discussion, one 
often uses the notation 

(2.04.5) (x, y) = (y, x) = x^Gy. 

An often-used type of coordinate system is one in which each reference 
vector e t is of unit length and orthogonal to all the others. In this case 

G = /, xy = x j y = y*x. 

With reference to the coordinate system f given by (2.01.14), the 
metric is defined by 

H = f T f. 
Now 

G = e T e = EVtE, 

(2.04.6) G = E^HE. 

This relates the metrics for the two coordinate systems. 
If both e and f are unit-orthogonal systems, then 

G = H = I 
and hence 

/ 
In this case 



and the matrix E is said to be orthogonal. 

When G is not the identity matrix, then in the process of evaluating a 
scalar product of two vectors x and y, given x and y, one must form either 
Gx or Gy. Since if one knows the vector 

(2.04.7) x' = Gx 
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one can always find x by solving the equations, knowing x' is (in principle) 
equivalent to knowing x. Hence x' is also a representation of x, but of 
a different kind from x. It is customary to speak of x' as giving the 
covariant representation or as being a covariant vector, and of x as giving 
the contravariant representation or as being a contravariant vector. If 
y' is the covariant representation of y, then 

xy = x*y' x n y = y j x f = y n x. 

In case x and y are the same vector, the scalar product is the square of 
the length: 

xx = |x| 2 = x^Gx. 

Since any numerical vector x represents some geometric vector x, it 
follows that always 

x^Gx > 0, 

and the equality can hold only when x is the null vector, x = 0. By 
virtue of this property of the matrix G, it is said to be positive definite. 
The vectors e can always be referred to a unit-orthogonal system f, and 
in this case (2.04.C) becomes 

G = 1FE, 

since // is then the identity. It will be shown in 2.201 that any positive 
definite matrix can be expressed as a product of a matrix by its transpose. 
Hence any positive definite matrix represents a metric in some coordinate 
system. 

There are always several geometric settings, any one of which could 
give rise to a given set of linear algebraic equations. Hence given the 
equations, we are at liberty to associate any geometric picture that seems 
convenient. This will be done from time to time in the presentation 
of the various methods for solving linear systems. The geometric 
vectors e, f, etc., seldom if ever need to be introduced explicitly. Given 
a numerical vector x, it is sufficient to know that, given a coordinate 
system e, the numerical vector x defines a geometric vector x by the 
relation 

x = ex. 

Hence we shall often speak of x as though it were itself a geometric vector, 
and we shall refer to it simply as a vector, without qualification. 

2.05. Rank and Nullity; Adjoint and Reciprocal Matrices. The outer 
product of two vectors has been interpreted geometrically as representing 
an oriented parallelogram; the outer product of r vectors represents an 
oriented parallelepiped of r dimensions. We shall take it as geometrically 
evident that the outer product of r vectors can vanish if and only if the 
vectors lie in a space of dimension less than r, or in other words, if they 
are linearly dependent. 
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A matrix is said to have rank r in case it has r linearly independent 
columns, but every r + 1 columns are linearly dependent. Hence every 
column is expressible as a linear combination of these r columns. The 
outer product of the vectors represented by these r columns must be 
non-null, so that at least one submatrix formed from these r columns must 
have a nonvanishing determinant. On the other hand, no submatrix of 
order r + 1 can have a nonvanishing determinant. Hence a matrix is 
of rank r if and only if the largest submatrix with nonvanishing deter- 
minant is of order r. By applying the above argument to the transpose 
of the matrix, everything that has been said of the columns applies 
equally to the rows. 

A square matrix of order n and rank r is said to have nullity n r. 
If we suppose the coordinate vectors e to be a unitary orthogonal system, 
the homogeneous equations 

A*x = 

are satisfied by any vector x orthogonal to all the columns of A. But if A 
has rank r, its columns determine an r-dimensional subspace of the 
n-dimensional vector space, and there is an (n r) -dimensional subspace 
orthogonal to it. Hence the equations have n r linearly independent 
solutions. 

A matrix is nonsingular in case its determinant is nonzero. If 

(2.05.1) x = S&a< = &x 
and the vectors a are linearly independent, then 

i[ai, a 2 , . . . , a n ] = [iai, a 2 , . . . , a] 

= [iai + 2 a 2 + -f n a n , a 2 , . . . , a n ] 
= [x, &2) > a n j. 
Hence if 

a = eA, 
these equations give 

(2.05.2) i|0i a 2 ... a n | = \x 0,2 ... a n | 

when we drop the outer product of the e,. Since the vectors a are linearly 
independent, the matrix A is nonsingular, and hence 

(2.05.3) 1 = \x c&2 ... & n |/|fli #2 ... o n |, 

and, in general, each & is the quotient by \A\ of the determinant obtained 
from |A| when x replaces a,-. This is Cramer's rule. 
If we write 

(2.05.4) A - (a*), 

and if A/< is the cof actor of / in |A|, the matrix 

(2.05.5) adj (A) = 
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of the cof actors is called the adjoint of A. The expansion rules illustrated 
in 2.03 for a determinant of order 3 can be expressed 

(2.05.6) A adj (A) - adj (A) A = \A\I\ 

the product of a matrix by its adjoint in either order is a matrix with 
everywhere except along the principal diagonal, and there every element 
has the value \A\. 
Now let 

(2.05.7) cfl = A*/\A\. 

Then 

(a") = |A|-' adj (A), 

foXa") - /, 
so that 

(2.05.8) A- 1 = \A\~i adj (A). 

This gives the explicit representation of the elements of the inverse 
matrix. 

2.051. Projection operators. Let a represent a set of m < n linearly 
independent vectors, and 

(2.051.1) a = eA. 

These vectors form a basis for a subspace of m dimensions. If e T e = /, 
then 

a T a = 



defines the metric for that subspace. The matrix is nonsingular, since 
otherwise a non-null vector x would exist satisfying A* Ax = 0, and hence 
#M T Az = 0, and the non-null geometric vector ax would have length 
zero, which is impossible. 
Hence the symmetric matrix 

(2.051.2) P = A(A t A)~ l A t 

exists, and is said to be idempotent since 

P 2 = A(A r A)- l A 1 A(A 1 A)- l A t = A(A T A)~U T = P. 
If x = ex is any vector, then 

ePx = 



is a vector in the space of a. If y = &y eAy is any vector in the space 
of a, then 

eP(Ay) = eA(A' f A)- l A^(Ay) = eAy = y. 

Hence if x represents any vector, Px represents a vector in the space of A, 
and if x = Ay represents a vector in the space of A, then Px represents 



30 PRINCIPLES OF NUMERICAL ANALYSIS 

the same vector. Thus the matrix P projects any vector into a particular 
subspace and leaves unchanged any vector already in that subspace. It 
is therefore called a projection operator. 

The projection is, indeed, an orthogonal projection. For given any 
x, the projection is Pa;, and the residual is x Px. But since P is 
symmetric, the projection and the residual have the scalar product 



- P)x = x*(P - P z )x = 0, 

since P is idempotent. 

Any symmetric idempotent matrix P is a projection operator. For 
if P has rank w, we can find a matrix A of m linearly independent columns 
such that every column of P is a linear combination of columns of A. 
Hence, for some matrix B we can write 

P = AB*. 
We have only to show that 

B = A(A t A)- 1 . 
Since P is idempotent, 

p = p2 = 



Since the columns of A are linearly independent, B T AB T = B T . The 
rank of P = AB J cannot exceed the rank of Z? T ; hence B J has rank at least 
m, and having only m rows, the rank is exactly m. Hence 



= I = 
Since P is symmetric, 



AB*A = J5AU, 
A BA^A. 
This is the desired result. 

For an arbitrary metric, e T e = G, the orthogonal projection is repre- 
sented by a matrix 

(2.051.3) P = A(A f GA)- l A r O t 

where the columns of A are contravariant vectors. 

2.06. Cayley-Hamilton Theorem; Canonical Form of a Matrix. If with 
any vector x we associate its successive transforms 

(2.06.1) as, = T*x (i = 0, 1, 2, . . .), 

at most n vectors of the sequence XQ, x\, Xz, . . . will be linearly inde- 
pendent. Hence for some r < n there exist scalars 70, 7i, . . . , y r not 
all null such that 



(2.06.2) (TO/ + 7i!T + - + y r T')x 0. 
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The matrix 

(2.06.3) iKT) = TO/ + ViT + - + y r T' 
is a polynomial in T, and since 

t(T)x = 0, 

x is said to lie in the null space of $(T). If both x and y are in the null 
space of $(T}, then ax + py is also in this null space for any scalars a and 
0. The null space consists of the null vector only, unless $(T) is a 
singular matrix. If it is nonsingular, we shall say it has no null space, 
disregarding the trivial case of the null vector alone. 

If <f>(T) and $(T) are polynomials in T, and if there is a non-null vector 
x which lies in the null space of each, then the scalar polynomials <(A) 
and iff(\) have a common divisor which is not constant. For if they do 
not, then by a classical theorem in algebra (proved below in 3.06), there 
exist polynomials /(X) and gr(X) such that 

(2.06.4) /(A)*(A) + 0(A)iKA) = 1 
identically. But then 

(2.06.5) fcn+cn + g(Twm = /, 

and hence 

(2.06.6) f(T)*(T)x + g(TMT)x = x. 

But since x lies in the null spaces of both < and \f/, the left member of this 
identity vanishes, and hence z is a null vector, contrary to supposition. 

Next to a constant, whose null space is the null vector alone, the 
simplest type of polynomial is linear. Hence consider a polynomial 
T XI. It has a null space if and only if its determinant vanishes: 

(2.06.7) \T - X7 1 =0. 

This determinant, when expanded, is a polynomial in X of degree n, 

(2.06.8) 0(A) = \T - X/i = (-X) + Ti(-X)"- 1 + + 7, 



'n, 



called the characteristic function of T. Equation (2.06.7) is called the 
characteristic equation of T. One verifies easily that 



(2.06.9) 71 = 2r = t(T), 7n = \T\, 

where t(T) is called the trace of T and is equal to the sum of the diagonal 
elements. 

Any X satisfying (2.06.7) is called a proper value of T. If X is any 
proper value, there is at least one vector in the null space of T X/, 
and any vector in the null space of T XI is called a proper vector 
associated with the proper value A. If X is not a proper value, T AI 
has no null space. 
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For any proper value X, any vector in the null space of T X/ is also 
in the null space of (T X/) r for any positive integer r, but the converse 
is not true. A vector in the null space of (T \I) r for any positive 
integer r is a principal vector. If x is in the null space of (T X/) r 
but not in the null space of (T X/)'" 1 , it is called a principal vector of 
grade r. 

The characteristic function 4>(X) has the remarkable property that 



(2.06,10) 4>(T) = 0. 

This is the Cayley-Hamilton theorem, which can be stated otherwise by 
saying that the null space of <}>(T) is the entire space. This might be 
expected from the fact, shown above, that the null spaces of two poly- 
nomials in T have a non-null vector in common only if they have a 
common divisor. A proof of the Cayley-Hamilton theorem is as follows 
Since 



therefore this difference is equal to a polynomial in T multiplied by 
T X7, and hence is said to be divisible by T X/. Also, by Eq. 
(2.05.6), 

(T - X/) adj (T - XI) = <KX)I. 



Hence <(X)7 is also divisible by T X/. It follows, then, that $(T) is 
divisible by T X7. However, <t>(T) is independent of X and must 
therefore vanish. 

If F is any nonsingular matrix, then the matrices T and 

(2.06.11) T' = F~ 1 TF 

are said to be similar. They represent the same transformation but in 
different coordinate systems. Since 

F~ l (T - X7)F = r - XJ, 

they have the same characteristic function and hence the same proper 
values. One proves inductively that for any positive integer r 



and hence that 
(2.06.12) 

for any polynomial $. 

It is reasonable to expect that a given transformation might be more 
simply represented in some coordinate systems than in others, and this 
will now be shown. Note first that the theorem expressed by (2.06.4) 
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can be generalized as follows: If there is no nonconstant factor common to 
all polynomials <i(X), fa(\), . . . , < m (X), then there exist polynomials 
/i(X),/ 2 (X), . . . , / m (X) such that 



(2.06.13) /i(X)*i(X) + - +/ 

identically. The proof is inductive and sufficiently indicated by taking 
m = 3. It may be that <i and fa have a common factor rfi2(X), but if so, 
it is prime to fa. By applying (2.06.4) to = fa/diz and ^ = 
and then multiplying, we have for some g\ and g% 



Also for some g and h, 

gdiz + hfa - 1. 
Hence 

gg\fa + gg*fa + hfa = l. 

Now let the characteristic function 0(X) be factored completely: 

*(X) = (X - XO'(X - X,)" (X - X m ), 
where 

ni + n 2 -f + w m = n, 

and the X are all different. Clearly the polynomials 

fc(X) = (X - X,)- 



satisfy the conditions of the theorem. Hence with suitable polynomials 
(2.06.13) can be satisfied, and hence 



- +f m (T)4> m (T) =L 
Hence for any vector x 



But fa(T}x, and hence also fi(T}<t>i(T)x, is a principal vector since 

(T - X</)fc(!r) = 0(T) = 0. 



Hence any vector is expressible as a sum of principal vectors. 

Moreover, the representation of z as a sum of principal vectors is 
unique. For if not, the difference of distinct representations would 
express the null vector as a sum of principal vectors in the form 

2/i + 2/2 + + y m = 0, 

where y is a principal vector associated with X, and at least one ?/, is 
non-null. Suppose y\ 7* 0. Then 



fa(T)(yi + 2/2 -f ' ' + vJ - 0. 
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But since <f>i(T) contains every factor (X X,-) n< except (X Xi) Wl , it 
follows that 

*i(r)(yt+ +y0 - o. 

Hence 

4>i(T)yi = 0. 
But 

(T - \J) n >yi = 0, 

whereas (X Xi) ni and <i(X) have no common factor. Hence y\ = 0, 
contrary to supposition. 
Now for a new coordinate system choose a matrix 

(2.06.14) F = (FtF 2 - - - F m ) 

in which the columns of F\ form a coordinate system for the null space of 
(T Xi/) m , the columns of F 2 form a coordinate system for the null space 
of (T X 2 /) nt , .... If x is any vector in the null space of (T X,-/) n< , 
so also is Tx since 

(T - \J) n <Tx = T(T - \J) n <x = 0. 

Hence any column of TFi is expressible as a linear combination of columns 
of F t , and therefore 

(2.06.15) TF = FT' 
where T' has the form 



(2.06.16) T' = 



But F must be nonsingular, whence 
(2.06.17) F~ 1 TF = 2", 

and except for the trivial case m = 1, a partial simplification has been 
effected. We proceed next to specialize the choice of the columns within 
any F t . 

These are all principal vectors, and there is some Vi < n, such that 
every vector in the space of Fi is of grade ^ or less. First select a maximal 
linearly independent set of proper vectors in the space of Fi (correspond- 
ing to the proper value X,) . Adjoin to these a maximal linearly independ- 
ent set of (principal) vectors of grade 2, . . . , and finally complete 
Ft by adjoining vectors of maximal grade v^ Thus F t has the form 

Ft = (FH, F2, . . . , Ft,,), 

where all columns of F< are linearly independent and where every column 
of F# is a principal vector of grade j. 
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Now the columns of F,-,, are of grade j>, while those of 

(T - X,-/)F, X . 

are of grade vi 1. Furthermore, the columns of 

(T - \<I)Fi, t , F ivt 



are linearly independent. If this were not the case, there would exist 
equal linear combinations (T \J)x of the columns of the first sub- 
matrix and y of the columns of the second: 

(2.06.18) (T - \J)x = y. 

But then 

(T - \il)"<x = (T - \il)- l y = 

since all columns of F< are of grade < v. But then y is of grade j>, 1. 
Hence y, which is by definition a linear combination of columns of 
Fin, is a l so a linear combination of the remaining columns of F,-, and this 
is another way of saying that the columns of F are linearly dependent. 
Since this is not the case, y = 0. Hence by (2.06.18) a: is a proper 
vector, and hence both a linear combination of columns of FU and of F,-,,. 
Hence x = 0. 

The argument may be continued to show that the columns of 



(2.06.19) ((T - X,-/)"- 1 /^, . . . , (T - \i!)F iVi) F ivj ) 

are linearly independent. If there are more columns in FU than in 
(T X,-/)"*~ 1 Fi Vj , we adjoin additional ones to form a new matrix F,-i 
and continue to form a new matrix F of which the matrix (2.06.19) is a 
submatrix. We then show that the columns of 

((T - Xi/)"- 2 ^!, . . . , F^i, F<, { ) 



are linearly independent. Proceeding as before, we obtain finally a 
matrix F t such that (T X/)F t> ,< is a submatrix of F^-i, . . . , and 
(T Xi/)F2 is a submatrix of FU. 

Finally, the columns of F t are rearranged as follows: For / take any 
column of F,,,. Let 

/ = (T - XJ)/, 

/ = (T - Xi/)/, 



Then 

= (T - 

and every vector of the sequence is a vector in F<. If there are other 
columns in F ifi , take one of them as />,+! and proceed as before. When 



( 
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all columns of F 1K , are exhausted, pass next to a column (if any) from 
Fi,n-i which does not appear in one of the above chains; then pass to 

/*', P( 2; .... 

By so forming and rearranging the matrices F, which make up F, we 
obtain a matrix F whose columns are grouped into sequences such that, 
when the double subscripts of the fs are replaced by single subscripts 
from <1 to n, we have either 

or else 

for some X. Hence 

TF = FT', 

where now T' has the form 



(2.06.20) 



and each Ty is eit er a scalar X- or else has the form 
(2.06.21) T; = X t 7 + A, 



(2.06.22) /i = 



The matrix I\ is called the auxiliary unit matrix, and has units along the 
first subdiagonal and elsewhere has zeros. Note that 



has units along the second subdiagonal, and if /i is of order v, then 7j = 0. 
Every column of F is a principal vector of T. We could apply the 
above theorem to T T and in the process obtain a matrix G every column of 
which is a principal vector of T 1 . If / is a principal vector of T cor- 
responding to the proper value X, and if g is a principal vector of T T 
corresponding to the proper value n 7* X, then g and / satisfy 

(2.06.23) gWf = T / - 0. 

The proof can be made inductively. Suppose first that g and / are proper 
vectors. Then 

g*T - rf, Tf - X/, 
so that 

g*Tf - rff - Xfltf. 
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Since X 5^ /*> this proves the relation for that case. Next suppose g is 
a proper vector but / of grade 2. Hence 



(T - \I)f - /i * 0, 
but 

(T - X/)/i - 0. 

Hence f\ is a proper vector. Now 



as was just shown. Hence 
whereas 



g'Tf = 



and again, since X 7* /*, the relation is proved. By continuing one proves 
the relation for proper vectors g and / of any grade. 

If T is symmetric, T 1 = T, then any principal vector of T 1 is also 
a principal vector of T. But for a symmetric matrix we now show 
that all principal vectors are proper vectors, and in the normalized form 
(2.06.20) of T all matrices T( are scalars. 

This is clearly the case when the proper values are all distinct. In 
that case, in fact, every proper vector is orthogonal to every other proper 
vector, whence F*F is a diagonal matrix, and by choosing every vector / 
to be of unit length, one has even 

(2.06.24) F T F = / 

so that F is an orthogonal matrix. Suppose the proper values of T are 
not all distinct. One can, nevertheless, vary the elements of T slightly 
so that the matrix T + 8 T is still symmetric and has all proper values 
distinct. Then F + dF is an orthogonal matrix. As the elements of 
T + 8T vary continuously while the matrix remains symmetric, the 
columns f + 8f of F -\- 8F also vary continuously but remain mutually 
orthogonal and can be held at unit length. Hence these properties 
remain while 8T vanishes. Hence for any symmetric matrix T there 
exists an orthogonal matrix F such that 

(2.06.25) F*TF = A, 

where A is a diagonal matrix whose elements are the proper values of T. 
2.07. Analytic Functions of a Matrix; Convergence. The relation 
(2.06.12), valid for any polynomial, is easily extended. Consider first 
any of the matrices T' t of (2.06.20), neglecting the trivial case when T' t is a 
scalar. Any power can be written 

(n /Y7 IN rnt _ \rT _1_ .\r 

(4.U7.1) *< XJl T FAJ 
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and if TJ is of order v, there are at most v terms for any integer r. If 
|X| < 1, then for any fixed s 




Hence in this case as r becomes infinite, T's approaches the null matrix, in 
the sense that every one of its elements approaches zero. If for every 
proper value X; of T it is true that |X| < 1, then also T" approaches the 
null vector as r becomes infinite. Since F and F~ l are fixed, this is true 
also of F- 1 T"F, and hence of T r . Hence if every proper value of T has 
modulus less than unity, then T r as r becomes infinite. This con- 
dition is necessary as well as sufficient. 

Now consider any function ^(X) analytic at the origin: 



If \i lies within the circle of convergence of ^(X), then-since any derivative 
of ^ is analytic within the same circle, we may write formally 



(2.07.2) *(r,) = W + tf>(\J + /i) 



where at most v terms are non-null. Hence for a matrix of the form of T\ 
the analytic function ^(5PJ) is defined by this relation. But then if every 
\i lies within the circle of convergence, we may take 



i) 
(2.07.3) W) = ( 



and then we may further take 
(2.07.4) 



Hence if ^(X) is any function that is analytic in a circle about the origin 
which contains all proper values of T in its interior, then ^(T) is defined 
and, in fact, is given by the convergent power series 

(2.07.5) *(T) - W + tf>T + YMT* + ' 

2.08. Measures of Magnitude. Most iterative processes with matrices 
are equivalent to successive multiplications of a matrix by another matrix 
or by a vector, and hence involve the formation of successively higher 
powers of the first matrix. The success of the process depends upon the 
successive powers approaching the null matrix. An adequate criterion 
for this is given in terms of the proper values of the matrix, but the calcu- 
lation of proper values is generally a most laborious procedure. Hence 
other criteria that can be more readily applied are much to be desired. 



MATRICES AND LINEAR EQUATIONS 39 

For this purpose certain measures of magnitude of a vector or matrix 
are now introduced, and some relations among them developed. These 
measures are of use also as an aid in assessing the several types of error 
that enter any numerical computation. 

For any matrix A we define the bound b(A), the norm N(A), and the 
maximum M(A), and for a vector x we define the bound b(x) and the 
norm N(x). It turns out that a natural definition of a maximum M (x) 
for a vector x is identical with N(x). 

Taking first the vector x whose elements are &, we define 

b(x) = max 
(2.08.1) f 



Thus b (x) is the magnitude of the numerically largest element, while N(x) 
is the ordinary geometric length defined by the metric /. 
We define b(A) and N(A) analogously: 

b (A) = max lad, 

(2.08.2) ti 

N(A) = [Sa]W. 
Clearly 

(2.08.3) fe(A T ) = 6(A), N(A*} = N(A). 
If we use the notion of a trace of a matrix 



, 



(2.08.4) tr (A) = 2a 
then an equivalent expression for N(A) is 

(2.08.5) N(A) = [tr (AM)]* = [tr (A A 1 )]*. 

If fli are the column vectors of A, and oj. the row vectors, then 

(2.08.6) N*(A) = SAT 2 (a;) = ZN^), 
where the exponent applies to the functional value 

N Z (A) = [^(^l)] 2 . 
Hence if o = x and all other % = 0, 

N(A) = N(x). 

A useful inequality is the Schwartz inequality which states that for 
any vectors x and y 

(2.08.7) \xty\ - Wx\ < N(x)N(y). 

Geometrically this means that a scalar product of two vectors does not 
exceed the product of the lengths of the vectors (in fact it is this product 
multiplied by the cosine of the included angle). This generalizes immedi- 
ately to matrices 

(2.08.8) N(AB) < N(A)N(B). 
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Another useful inequality is the triangular inequality 

(2.08.9) N(x + y) < N(x) + N(y), 

which says that one side of a triangle does not exceed the sum of the other 
two, and which also generalizes immediately 

(2.08.10) N(A + B) < N(A) + N(B). 
Also we have 

(2.08.11) b(A + B) < b(A) + b(B). 
But 

(2.08.12) |aty| < nb(x)b(y), 



since in x^y there are n terms each of which could have the maximum value 
b(x)b(y). Hence for matrices 

(2.08.13) b(AB) < nb(A)b(B). 
If in (2.08.8) we take 

bi ~ x, b z = J> 3 = = b n = 0, 
then we have 

(2.08.14) N(Ax) < N(A)N(x). 
We now introduce the third measure: 

(2.08.15) M(A) = max N(Ax)/N(x) = max \x^Ay\/[N(x)N(y)] t 

x<p*Q x,vy*0 

or equivalently 

(2.08.16) M(A) = max N(Ax) = max l\x*Ay\. 

N(x) - 1 N(x) = N(y) - 1 

It is obvious that (2.08.15) and (2.08.16) are equivalent. We show now 
that the two parts of (2.08.15) or of (2.08.16) are equivalent. Take M(A) 
as denned by the first equality, and designate, for the moment, the last 
member by M'(A). We wish to show that 



M(A) = 
First we have 

\x*Ay\ < N(x)N(Ay) 

by the Schwartz inequality (2.08.7). But 

N(Ay) < M(A)N(y) 
by definition of M(A). Hence for any x and y 

\x<Ay\ < M(A)N(x)N(y), 
\x*Ay\/(N(x)N(y-)} < 
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Hence the maximum of the left member cannot exceed the right member, 
so that 

M'(A) < M(A). 

Now to prove that 

M(A) < M'(A), 
we take 

x = Ay 
and obtain 

\x*Ay\ = \y*A*Ay\ = N*(Ay). 

Hence when x is so denned 

N(Ay) 



N(Ay)N(y) 

and this quantity has M (A) for its maximum. Hence M'(A) cannot be 
less than M(A), and the theorem is proved. 
Since 

x J Ay = y*A J x, 

it follows from the second definition of M (A} that 

(2.08.17) M(A T ) = M(A). 
Also 

(2.08.18) M(A + B) < M(A) + M(B) t 

(2.08.19) M(AB) < M(A)M(B). 
To prove the first of these, we have 

N(Ax + Bx) < N(Ax) + N(Bx) < [M(A) + M(B)]N(x), 

the first inequality being a consequence of the triangular inequality 
applied to the vectors Ax and Bx, and the second following from the 
definition of the maximum. Since, therefore, 

N[(A + B)x]/N(x) < M(A) + M(B) 

for all vectors x 7* 0, the maximum value of the left member cannot 
exceed the right. 

For the second inequality we have 

N[A(Bx)} < M(A)N(Bx) < M(A)M(B)N(x), 

both inequalities being consequences of the definition of the maximum. 
If we divide again by N(x), the conclusion follows as before. 

We now establish the following relations among the functions N, b, and 
M of the same matrix: 

(2.08.20) b(A) < M(A) < N(A) < nb(A\ 

(2.08.21) N(A) <, nM(A). 
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To prove the first, we use the second definition of M in (2.08.16). If we 
take 

x = e it y = e it 
then 

x J Ay - 



Any choice of unit vectors for x and y will give a number x J Ay which 
cannot exceed the maximum. Hence 

M < M(A), 
and since this holds for any i and j, it follows that 

b(A) < M(A}. 
Next, since by (2.08.14) 

N(Ax)/N(x) < N(A), 

the maximum value of the left member cannot exceed N(A), and this 
maximum is M(A) by definition. Hence 

M(A) < N(A). 
To show that 

N(A) < 

we go to the definition and write 



since b*(A) is the greatest of the n 2 terms in the sum. When we take the 
square root, we have the desired result. 

Before proving (2.08.21), we prove first a more general result: 



N(AB) < M(A)N(B). 

If the columns of B are &,, then the columns of AB are Ab{. By (2.08.6) 
applied to the matrices B and AB, 



We get the second relation on taking square roots. The first follows after 
taking transposes from (2.08.3) and (2.08.17). Now (2.08.21) is an 
immediate consequence of the second of (2.08.22) when we take B = /, 
since 



Of the three functions 6, N, and M, the first is obtainable for any given 
matrix by inspection, and the second by direct computation. The third, 
however, is only obtainable in general from rather elaborate computa- 
tions, though it generally yields the best estimates of error. 
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A few additional properties of these functions will be useful. First 
we have 

(2.08.23) M(A 1 A) = M*(A). 

In view of (2.08.17) and (2.08.19), we know that the left member cannot 
exceed the right. On the other hand 

< N(x)N(A' r Ax) < 



the first inequality being a consequence of the Schwartz inequality 
(2.08.7), while the second follows from the definition of M. Hence the 
right member of (2.08.23) cannot exceed the left, and the equality there- 
fore follows. 

An orthogonal matrix X is one whose transpose is its inverse: 

X J X = XX* = I. 

Multiplication by an orthogonal matrix does not affect the norm of a 
vector: 

N 2 (Xx) = 



Hence for any matrix A if X is orthogonal, 

(2.08.24) N(AX) = N(A), M(AX] = M(A). 

By (2.06.25), for any symmetric matrix B there exists an orthogonal 
matrix X such that 

(2.08.25) X*BX = A 

is a diagonal matrix. The columns of X are the proper vectors of the 
matrix, and the diagonal elements of A the proper values. If 

B = A T A, 

then B is non-negative, and these proper values are all non-negative. 
We may suppose that they are arranged in order of decreasing magnitude, 

(2.08.26) Xi > X 2 > > X n > 0, 

the last equality holding only when A is singular. Clearly 



and furthermore 

M(B) = M(A). 
Hence 

(2.08.27) M(A) = Xi*. 
Also 

(2.08.28) N*(A) - 
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To see this, we observe that by definition the proper values X< of a 
matrix B satisfy the algebraic equation 

\B - X/| = 

and that the trace* tr (B) is the sum of the proper values, while by 
definition 

N*(A) = tr (A T A) = tr (B). 

These relations provide an alternative proof for (2.08.21) and the second 
inequality in (2.08.20). 

By analogy with M, we define 

(2.08.29) m(A) = min N(Ax)/N(x). 

*7*0 

Then 

(2.08.30) m(A) = X*. 
Also if A is nonsingular, 

(2.08.31) M(A~ l ) = X-w, ro(A- 4 ) = Xr*. 
Hence for a nonsingular matrix 

(2.08.32) M(A~ l )m(A) = 1. 

These relations arise from the fact that, if B is nonsingular, 

X*B~ 1 X = A- 1 , 
which is a special case of the relation 

(2.08.33) X^B'X = A', 

where r is any integer, positive or negative. 

We conclude this discussion by noting that, if x' is the vector whose 
elements are |,| and if ! is the vector each of whose elements is unity, 
then from (2.08.7) it follows that 



(2.08.34) Z|fc| < nN(x), 

where x has the elements &. This follows from the fact that 

= n*. 



2.1. Iterative Methods. Generally speaking, an iterative method for 
solving an equation or set of equations is a rule for operating upon an 
approximate solution x p in order to obtain an improved solution Xp+i, 
and such that the sequence {x g } so denned has the solution x as its limit. 
This is to be contrasted with a direct method which prescribes only a 
finite sequence of operations whose completion yields an exact solution. 
Since the exact operations must generally be replaced by pseudo opera- 



MATRICES AND LINEAR EQUATIONS 45 

tions, in which round-off errors enter, the exact solution is seldom attain- 
able in practice, and one may wish to improve the result actually obtained 
by one or more iterations. Also since the "approximation" X Q with 
which one may start an iteration does not necessarily need to be close, 
it is sometimes advantageous to omit the direct method altogether, start 
with an arbitrary XQ, perhaps XQ = 0, and iterate until the approach is 
sufficiently close. 

2.11. Some Geometric Considerations. A large class of iterative meth- 
ods are based upon the following simple geometric notion: Take any 
vector 6 and a sequence of vectors (w p ), and define the sequence {b p \ by 



. 
p_i = p pp, 

where the scalar X p is chosen so that b p is orthogonal to u p . Then if the 
vectors u p "fill out" some n space, the vectors b p approach as a limit a 
vector that is orthogonal to this n space. Without attempting a more 
precise definition of what is meant by "filling out," we can see that it 
must imply the following: If the vectors e,- represent any set of reference 
vectors for this n space, then however far out we may go in the sequence 
{u p }, it must always be possible to find vectors U 8 = eu q with a non- 
vanishing projection on any e t and, in fact, with components that have 
some fixed positive lower bound. A possible choice for the vectors u p 
would be the reference vectors e t taken in order and then repeated. 
If d is the arithmetic vector associated with the geometric vector e, 
we have then 



where v is any integer. It is easily verified that the vector e< has 1 in 
the z'th place and in every other position. 

The vectors e t are entirely subject to our choice, and we may choose 
any convenient positive definite matrix H to represent the metric. Then 
the orthogonal projection upon u p is represented by u 
(cf. 2.051), whence 



(2.11.2) Xp = 
Now we write 

(2.11.3) Ax = y 

as the equations to be solved. Let x p represent any approximation to the 
solution x. If 

(2.11.4) As p =* r p *= y - Ax p , 

then either s p or r p can be taken as representing the deviation of the 
approximation from the true solution. Hence either s p or r p may be 
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taken as b p . We consider separately the case when A is itself positive 
definite and that when A is not itself positive definite. 
2.111. The matrix A is positive definite. Let 



I p === 



Then 

y Ax p = b p bp-i \pU p y Ax v -\ \ p u p . 



Hence" by comparing first and last members in the equality 



Ax p = Axp-i + XpWp 

Xp Xpi ~r~ Xp.A 
Therefore we take 

Up = Av p 
and obtain 

(2.111.1) x p = Xp-i + 

Now 

X = 



Consequently it is natural to take 

H = A- 1 , 
which gives 

(2.111.2) 

Alternatively we may take 

s p = bp. 
Then 

y Ax p = Ab p A(b p -i \ p u p ) = y Ax p -i \ F Au P . 
Hence 

Ax p 

or 

(2.111.3) 

But 

XP = 
If we take 

H = A, 
we have 

(2.111.4) 

Equations (2.111.3) and (2.111.4) are identical with (2.111.1) and 
(2.111.2) except for the replacement of v p by u p . 
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In geometric terms, since A is taken to represent the metric, our 
equation (2.11.3) requires the contravariant representation z of a vector 
whose covariant representation y is known. We begin with an approxi- 
mate representation XQ. Then 

ro = y Ax Q A(x XQ) 
is the covariant representation of the error x x , and 



is the contravariant representation of the orthogonal component of r 
in the direction of UL When this vector is added to x 0) the new residual 
TI = x Xi is thus a leg of a right triangle of which r<> is the hypotenuse. 
Hence we have a better approximation provided only Ui and TO were not 
orthogonal. 

Any rule for selecting the u p (or equivalently the v p ) at each step 
defines a particular iterative process. There are three in common use: 

1. The method of steepest descent prescribes that we take 



To understand the reason for this selection, we note first that, since A is 
positive definite, by hypothesis, there exists therefore a matrix C for which 

A = C T C. 
The equations 

Ax = y 

are therefore equivalent to the equations 

Cx = z, 
where 

Ch = y. 
Define the function 

g(x) = (Cx - zY(Cx - z) 



= x J Ax 
This function is a sum of squares which has a minimum of zero for 

x = C~ l z = A~ l y. 

The function g(x) is a function of the n variables 1, . . . , n . Its 
partial derivatives with respect to these variables are the coordinates 
of the gradient vector 

g x = 2(Ax - y), 

and the gradient at the point x p -\ is 2r p _i. Hence the function g(x) 
evaluated at the point Xp-i is undergoing its most rapid variation in the 
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direction r,_i, and if we think of our problem as one of minimizing g(x) by 
proceeding in successive steps, it is natural to take each step in the direc- 
tion of most rapid decrease. If we define 



we shall find that <t>, as a function of the single variable X, takes on its 
minimum value at X = X p given by (2.111.4). 
2. The method of Seidel takes 



Up == 



From (2.111.3) we see that x p differs from x p -\ only in the ith element. 
Furthermore, since b p and e are to be orthogonal, this means that the tth 
element of r p = As p must equal zero, which means that the ith equation is 
satisfied exactly. Hence the ith element of x p is chosen so that the ith 
equation will be satisfied when all other elements are the same as for 
Xp-i. While we may expect that this process will require more steps 
than does the method of steepest descent, the simplicity of each step is a 
great advantage, especially in using automatic machinery. 

3. The method of relaxation always takes u p to be some e, but the 
selection is made only at the time. Since the choice u p = e t - has the effect 
of eliminating the ith component of r p , one chooses to eliminate the largest 
residual. However this is not necessarily the best choice. The effective- 
ness of the correction is measured by the magnitude of the correcting 
vector \pU p , and this magnitude is 



Now when u p = e t -, then ulAsp-i is the tth component of the residual 

/" * 

rp_i, but this is divided by the length of e which has the value of v 
Hence one should examine the quotients of the residual components 
divided by the corresponding \/a and eliminate the largest quotient. 

This method clearly converges more rapidly than the Seidel method, 
which projects upon the same vectors but in a fixed sequence. Therefore 
for "hand" calculations it is to be preferred. For automatic machinery, 
however, the fixed sequence is almost certainly to be preferred. 

2.112. The matrix A is not necessarily positive definite. This case 
can always be reduced to the preceding if we multiply throughout by A 1 . 
However, this extra matrix multiplication is to be avoided if possible. 
With regard to the equations 

Ax = y, 

we may adopt either of two obvious geometric interpretations. 

The simplest interpretation is that we wish to change the vector 
coordinates, as in Eq. (2.01.8), where y, taking the place of x', is known, 
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and A, taking the place of E, is known. In the symbols used here, 
therefore, the columns of A are the numerical vectors which represent the 
e t - in the system f , and the column vector y represents x in the same system. 
The vector y is to be expressed as a linear combination of the columns 
of A, which is another way of saying that the vector x, whose representa- 
tion is known in the f system, is to be resolved along the vectors e. 

The other interpretation comes from regarding each of the n equations 
as the equation of a hyperplane in n space. If a< represents the ith row 
vector in A, then the ith equation is 



and this equation is satisfied by any vector x leading from the origin of the 
point-coordinate system to a point in the hyperplane. If we divide 
through this equation by JV(a<), the length of a t , we obtain 



and since the vector multiplying x is a unit vector, the equation says that 
the projection of x upon the direction of a is of length in/N(a,i), and hence 
the same for all points in the plane. Consequently the vector a, is 
orthogonal to the plane, and the distance of the plane from the origin is 



In case we think of the underlying coordinate system e as nonorthogo- 
nal, the vectors a are taken to be covariant representations of the 
normals, and the vector x as the contravariant representation of the 
vector x drawn to the common intersection of the n planes. 

These two geometric interpretations suggest different iterative schemes. 
We begin with the hyperplane interpretation. 

2.1121. The Equations Represent a System of Hyperplanes. If v is 
any column vector, then 

(2.1121.1) v*Ax - Jy 

is also the equation of a hyperplane passing through the point x. The 
normal, written as a column vector, is A*v. If x p is any approximation 
to x, and s p and r p are defined as before, 

(2.1121.2) As p = r p = y - Ax p = A(x - x p ), 
then we may take 

(2.1121.3) b p = s p = x - x p = A-Vp, 

and project upon the vector Up+i A^Vp+i. This amounts to writing 

Ax p = y 



so that as b p vanishes, x p approaches x. The basic sequence as defined 
by (2.11.1) and (2.11.4) takes the form 



50 PRINCIPLES OF NUMERICAL ANALYSIS 

(2.1121.4) 6p_i - b p 

(2.1121.5) X p = uJry 

if the identity matrix is taken to define the metric. But then (2.1121.4) 
gives 

(2.1121.6) Xp = Zp_i -f X p A T y p . 

By 'analogy with the method of steepest descent as described for the 
positive definite case, we may define the non-negative function 

g(x) = (Ax - yY(Ax - y} 

and find that at x p -i its gradient lies in the direction A. T r p _i, whence the 
choice 

v p = r p _i 

is optimal from the point of view of most rapid minimization of f(x). 
If we take 

V = V V ni = &i 



then, as appears from (2.1121.4), we are projecting the vectors b p in 
rotation upon the normals to the basic hyperplanes of the system. In 
this event x p is caused to lie in the ith hyperplane so that the ith equation 
is satisfied, though in general all components of x p will differ from those 
of Xp-i. The numerator in X p is simply the ith residual, i.e., the iih 
element of r p _i, while the denominator is the sum of squares of the ith 
row of A. 

Hence if we follow the lead of the method of relaxation, we would 
choose v p to be that BJ that would provide the maximal correction 



Hence to make the optimal choice, we should divide each residual by the 
square root of the sum of the squares of the corresponding row of A, and 
select the largest quotient. Presumably all these square roots will be 
used repeatedly and should be calculated in advance. 

2.1122. The Equations Represent a Resolution of the Vector y along 
the Column Vectors of A. If x p is any set of trial multipliers, 

fp =s y AXp 

represents the deviation of the vector Ax P from the required vector y> 
Take 

b p - r p , 
and let 
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represent any linear combination of the columns of A. Then Eqs. 
(2.11.1) and (2.11.4) give 

y Ax p -i = y Ax p + \ p Av p , 
or 

(2.1122.1) x p = XP-.I + \pV p , 
with 

(2.1122.2) \ p - vJAW^Uv 
The method of steepest descent would take 

v p = A T fp_i, 

a choice that complicates the denominator in \ p excessively. In taking 
v p = i for some t, we alter only one element of x v -\ in obtaining x p , but 
no element of r p is made to vanish, so no one of the equations is necessar- 
ily satisfied exactly. To find the optimal e according to the principle 
of the method of relaxation, we observe that we wish to maximize the 
vector 



or 



However, the numerators of the possible X p have the form 



which is a complete scalar product of the tth column of A with the residual 
vector fp_i. Taking this scalar product represents the greater portion 
of the labor involved in the complete projection, so that one would 
probably always take the vectors e* in strict rotation. 

2.113. /Some generalizations. The methods described in 2.111 con- 
sisted in taking each residual Sp_i = x x p -\ from the true solution #, 
projecting orthogonally upon a vector u p , and adding the projection to 
Xp-i to obtain an improved approximation x p . The new residual s p 
was orthogonal to the projection on u p . Clearly if the projection is 
made on a linear space of two or more dimensions, the projection, i.e., the 
correction, will be at least as large as the projection on any single direction 
in this space. Hence it is to be expected that the rate of convergence 
of the process would be more rapid if, instead of projecting each time 
upon a single vector u p , we were to project upon a linear space of two or 
more dimensions. Such a space may be represented by a matrix U p 
such that any vector in the space is a linear combination of columns of 

u f . 
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The problem is now the following: Given any matrix U p , we wish to 
project the residual Sp_i orthogonally upon the space U p (that is, the 
space of linear combinations of its columns). The projection will be 
taken as a correction to be added to x p -\ to yield the improved approxi- 
mation x v . The orthogonal projection is represented by the matrix 
U 9 (U1AU 9 )- 1 U}A, and we find now that 

(2.113.1) x p - x p -i + UAVlAVJ-Wjr^i. 

The scalar \ P is for present purposes to be replaced by the vector 



Its expression is the same as for \ p except that the matrix U p replaces 
the vector u p , and it is to be noted that the reciprocal matrix enters as a 
premultiplier. While the method does provide a larger correction, in 
general this advantage is offset by the necessity for calculating an inverse 
matrix whose order is equal to the dimensionality of the subspace upon 
which the projection is being made. Nevertheless in special cases this 
inversion may prove to be fairly simple. 

If the columns of the matrix U p are unit vectors e,-, then the matrix 
C/JAC/p is a principal submatrix of the matrix A. If, say, we take U p 
to be the two-column matrix (e, e 3 ), then 




The correction is made so as to eliminate both the zth and the jth elements 
of r p by appropriate selection of the tth and jth elements of x p . Hence 
the ith and jth equations are solved simultaneously for these elements in 
terms of the current approximations to the other elements. This is to 
be contrasted with the method of 2.111 where the tth and jth equations 
are solved at different times and not simultaneously. 

The iterations for the case when A is not a symmetric matrix can be 
generalized in like manner. When we interpret the equations as repre- 
senting a system of hyperplanes and project upon a space determined by 
normals to these hyperplanes, we obtain 

(2.113.2) x 9 - aw + A T 7 p (FJAA T F p )- 1 F p r p _i, 

where V p is an arbitrary matrix of linearly independent columns. When 
we interpret the equations as requiring the resolution of the vector y 
along the column vectors of A, we get 

(2.113.3) x p - Xf-i + FptFJAUF^FJAVp-L 

2.12. Some Analytic Considerations. The iterative methods so far 
discussed have been suggested by geometry. Analytical considerations 
suggest several others. 
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2.121. Cesari's method. In the Seidel iteration Xi differs from z only 
in the first element, which is so chosen that the first equation is satisfied 
exactly. Next x^ differs from Xi only in the second element, which is so 
chosen that the second equation is satisfied exactly. In the nth step, 
x n differs from x n -i only in the nth element, which is chosen to satisfy 
the nth equation exactly. Next x n +\ is obtained from x n by readjusting 
the first element to satisfy the first equation, and this begins a new cycle. 

Let us write A A\ -f A z , where in A\ all diagonal and subdiagonal 
elements are the same as those in A, while all supradiagonal elements are 
zero: 

A _ ( r \ / By when i > j, 

(2 121 1) * ~ ' "* = when i < j .' 

A f m // ~ when i > j, 

AZ ~ (an), a.-,- , . . . 

w v = ay when i < j. 

Then we verify easily that 

- y - 



Since the vectors z pn + for < i < n need not enter explicitly, we may 
modify the notation by writing simply X P for what had been designated 
x pn , and the iteration is written 

(2.121.2) Aiavn = y - A 2 x p , A = Ai + A*. 



We know that this iteration converges when A is positive definite 
and AI and At are given by (2.121.1). However (2.121.2) formally 
defines an iterative process whether or not A is positive definite, and 
whether or not Ai and A 2 satisfy (2.121.1). It is required only that A\ 
be nonsingular. Since (2.121.2) is equivalent to 

x p = Ai 
and x satisfies 

x = Ai 

it follows that the residuals s p = x x p satisfy 



Hence for the sequence (2.121.2) to converge for an arbitrary y, it is 
necessary and sufficient that the proper values of the matrix A^ l Ai 
should all be of modulus < 1. The characteristic equation for this 
matrix can be written 

(2.121.3) |XAi -f A t \ - 0. 

A sufficient condition for convergence is that 

(2.121.4) M(A^A^ < 1, 
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and hence, a fortiori, it is sufficient that 
(2.121.5) N(Ai*A t ) < 1, 



the latter criterion being one that is fairly readily applied. 
Cesari's derivation" of the same sequence is interesting. Let 

(2.121.6) Ai + A* = yA, 

where, 7 is any scalar and AI is nonsingular. Let the vector x(n) as a 
function of the scalar /* be denned by 



(2.121.7) Ui + MAOsGO = yy + (M 

with v an arbitrary vector. Then 

= x. 



Let 

*(/) = x, + uasi + MV '/2! +'.. 

Then on substituting into (2.121.7) and grouping terms, one obtains 
= (A 1*0 - 

If 
then 



1^0 - yy 

= yy - Azx p (p > 1), 



and we have again the recursion (2.121.2). 

When Ai is a diagonal matrix whose diagonal is the same as that of 
yA, we have a method discussed by von Mises and Geiringer. Since for 
such a choice of A\ the diagonal elements of A 2 are all zeros, and since A\ 
is a diagonal matrix, the diagonal elements of A^At are also all zeros. 
It is no restriction to suppose that Ai = / and that 7 = 1, for we may 
replace the original system by the equivalent one, 



Then (2.121.5) yields one of the criteria given by von Mises and Geiringer 
which they state in the form 



I 



4 < 



We obtain another of their criteria, 

\q\ < < 1, 



i+i 
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by noting that, if a ( f are the elements of s p , then |<rj p+1) | < Y |y| |<r[ p) |, 

i+i 

whence the criterion implies that S|<r{ p+1) | < aZ|<r|- p> |. 
Now suppose A is positive definite and write 

(2.121.8) 7(7 + B) = A. 

Thus we take Ai = 7, A z = B for (2.121.2). Since 

X7 - A = T [(X/7 - I)/ - B], 

it follows that for each proper value X of A there is a proper value 
(X, 7)/7 of B, and convergence of the process requires therefore 
that every X,- < 27. If 7 is so chosen, and if the proper values of X are 
arranged in order of magnitude, 

Xi > X 2 > > X n , 
then the proper values of B, in the same order, are 

Xi/7 - 1, X 2 /7 1, . . . , X n /7 - 1, 

and at least X n /7 1 is negative. If 7 is taken too near to Xi/2, then 
Xi/7 1 will be close to 1, and convergence will be slow; if 7 is taken 
too large, X n /7 1 will be close to 1, and convergence again slow. 
The optimal choice is 

(2.121.9) 7 = (Xi + X,)/2, 
for then 

(2.121.10) Xi/7 - 1 = -(Xn/7 - 1) = (Xi - X n )/(Xi + X n ), 

the extreme proper values being equal in magnitude. 

We may now ask, with Cesari, whether by any choice of a polynomial 
/(X), with F(X) = X/(X), the system 

F(A)x = f(A)y, 

equivalent to the original, yields a more rapidly convergent sequence. 

The proper values of F(A) are /*,- = F(X). If M' and \i!' are the largest 
and smallest of the /,-, we wish to choose F(X) so that (ju' n"}/(n' + M") 
is as small as possible, as we see by (2.121.10). Hence we wish to choose 
F(X) to be positive over the range (Xi, X n ), and with the least possible 
variation. 

The simplest case is that of a quadratic function F, and the optimal 
choice is then 

F - X(a - X), a - (Xi + XO/2. 

Ordinarily one does not know the proper values in advance, though one 
might wish to estimate the two extreme ones required (e.g., see Bargmarm, 
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Montgomery, and von Neumann), or these might be required for other 
purposes. 

2.122. The method of Hotelling and Bodewg." The iterations so far 
considered have begun with an arbitrary initial approximation XQ (which 
might be x 0). Suppose, now, that by some process of operating 
upon y in the system 

(2.122.1) Ax = y, 

perhaps by means of one of the direct methods of solution to be described 
below, one obtains a "solution" XQ which, however, is inexact because 
it is infected by round-off. The operations performed upon the vector y 
are equivalent to the multiplication by an approximate inverse C: 

(2.122.2) X Q - Cy. 

Then by (2.11.4) the unknown residual So satisfies the same system as 
does x, except that r<> replaces y, and hence we might expect that Cr is 
also an approximation to s . Hence we might suppose that 

xi = x + Cr = C(27 - AC}y 

is a closer approximation to x than is #o. Otherwise stated, it would 
appear that, if Co is an approximation to A" 1 , then 

d = C (27 - AC*) 
is a better one. 

If Co is an approximation to A" 1 , then A Co is an approximation to /. 
Let 

(2.122.3) B p = I - AC P , Cp+j - C P (I + B p ). 

Then 

B^ = I - AC P ^ = I - AC P (I + B p ) = El 

and therefore 

(2.122.4) B p = BI>. 

Hence if M(B ) < 1, then M(B p+l } < M(B P \ and if JV(B ) < 1, then 
N(B p+l } < N(B P ), and in fact, 

(2.122.5) M(B P ) < [M(J3 )p, AT(B P ) < [N(B<>)]*>. 

If A is positive definite, then we can always transform the equations 
if necessary and secure that M(A) < 1. Then all proper values ju of 
I A satisfy < /* < 1 so that M(I A) < 1. In this case we may 
take 

Co = /, Bo =ss / A t 
whence 

B, = (/ - A)", 

and convergence is assured. 
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To return to the general case, if C is any approximate inverse, so that 
Af (B ) is small, we have 

A-* = C (7 - Bo)- 1 = C (7 + Bo + BJ + ), 
and the series converges. It is easily verified that 

(7 + Bo) (7 + B{) = / + Bo + B{ + B ', 
(7 + Bo) (7 4- B})(7 + BJ) - 7 + B + + BJ, 

Hence A" 1 is expressible as the infinite product 

(2.122.6) A- 1 = C (7 4- B )(7 4- Bj)(7 4- BJ) 

In applying this scheme, one retains the successively squared powers of 
Bo, adding 7 after a squaring, and multiplying by the previously obtained 
approximation. The identical argument carries through if we take 

/?' T rA 

>Q A vQA) 

and obtain 

(2.122.7) A- 1 - - (7 4- B' 4 )(7 4- B 1 )(7 + B )C . 

This scheme is given by Hotelling and by Bodewig. 

In solving a system of equations, it may be preferable to operate 
directly on the vector y and the successive remainders, as was originally 
suggested. In this case the matrix C may not be given explicitly. Let 

(2.122.8) o = * = Cv, -- A*. = By 

Vp-fl t/^pj 'V+l ^P -AUp-f-i &Tp. 

Then 
(2.122.9) 

The procedure is to compute tyn from the last remainder r p , and then 
compute the next remainder r^i. 

2.13. Some Estimates of Error. The fact that an iterative process con- 
verges to a given limit does not of itself imply that the sequence obtained 
by a particular digital computation will approach this limit. If the 
machine operates with <r significant figures in the base 0, we are by no 
means sure of <r significant figures in the result. At some stage the 
round-off errors introduced in the process being used will be of such 
magnitude that continuation of the process is unprofitable. However 
another, perhaps more slowly convergent, process might permit further 
improvement. In any case it is important to be able to estimate both 
the residual errors and the generated errors. In presenting these 
estimates, it will be supposed that the equations to be solved are them- 



= o 4- i + ' * -f v p , 
= y - Ax p = B* +l y. 
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selves exact. The extent to which an error in the original coefficients 
affects the solution will be discussed in a later section. 

2.131. Residual errors. We first consider residual or truncation errors 
neglecting any effects of round-off . If M (H) < 1, then from the identity 



(7 - 

= I + H(I + H + tf 2 + - - ) 
it follows that 

M[(I - H)~ l ] < 1 + M(H)M[(I - JET)" 1 ]. 
Hence 

(2.131.1) M[(I - H)- 1 ] < 1/[1 - M(H)]. 

Thus in certain cases a bound for the maximum value of a reciprocal can 
be obtained from the matrix itself. From the same identity, since 

< l,then 



(2.131.2) 

and if nb(H) < 1, then 

(2.131.3) b[(I - H)- 1 ] < 1/[1 - nb(H)]. 



The last two inequalities are generally less sharp but more easily applied. 
Now consider the sequences C p and B p defined by (2.122.3). Since 

A- 1 = C (/ - Bo)" 1 , 
if we set H J5 , then 

(2.131.4) M(A-i) < M(Co)/[l - Af (Bo)], 

provided the denominator is positive. To establish the analogous 
inequalities using N and 6, we note that 

A-' = Co + CoB + CBl + - , 
whence 



(2.131.5) N(A~^ < N(Co) + N(C Q )N(B ) 

= AT(C )/[1 - 
and 



(2.131.6) b(A~ 1 } < 6(C ) + nb(C Q )b(B Q ) + 

= 6(C )/[1 - nb(Bo)], 

again provided the denominators are positive. 

Now from (2.122.4) and (2.122.3) we can deduce that 



Bf), 
or 

(2.131.7) A- 1 - C p - A~ l Bf. 



MATRICES AND LINEAR EQUATIONS 59 

Hence given the hypotheses on M(B Q ), N(Bo), or wb(J5 ), as the case may 
be, we have 

M(A~ l - Cp) < M(A- l )M 2P (B ) ) 
whence by (2.131.4) 

(2.131.8) M(A~ l - Cp) < M(C )M 2P (o)/[l - Af (B )], 
and analogously 



(2.131.9) N(A~ l - C 9 ) < 

(2.131.10) 6(A- - C p ) < n 2P 6(C )6 2P (o)/U - nb(B )]. 



We can write (2.131.7) in the form 

A- 1 - Cp = C (/ - 



If 
then 



s p = x 



s p = C (/ - Bo^Bfy. 
Hence we have, for example, 

- M(B Q )]. 



If d is the ith unit vector, e]s p is the ith element in s p . Hence we may 
write, with de la Garza, 



(2.131.11) \e]s p \ < AT(eTC )M 2P (o)A/WU - M(B )], 

(2.131.12) \e} 



For the Seidel iteration we have 

A = A, + A,, 

A- 1 = (I - H)- 1 A^, H = -Ar 
Let 

Cp == (7 + H + H 2 + - - - + 
Then 

A- 1 - Cp = #"(/ 
and 



(2.131.13) 3f (A- 1 - 

(2.131.14) ^(A- 1 - C p ) < 

(2.131.15) b(A~ l - Cp) < n +1 6"(H)6(Ar l )/[l - nb(H)]. 



If we take x = in the Seidel iteration, then x p C p y, and the devia- 
tion is x x p (A -1 Cp)?/. In general, however, for an arbitrary x 
if r p = y Axp, then 



r p 
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and 

N(r p ) < 
Since 



x - x p = / - 
then 

8 P = x - x p = (7 - 
Hence 



(2.131.16) N(s p ) < N( Xp+l - x p )/(l - M(H)] 

Also 



(2.131.17) N(s p ) < M(H)N(x p - x p ^/(l - M(H)\ 

< N(H)N(x p - z p _i)/[l - N(H)]. 

These inequalities provide estimates of the error in terms of the magnitude 
of a particular correction. 

2.132. Generated errors. If A is symmetric, let /* and u be the numeri- 
cally smallest proper value and an associated proper vector, respectively. 
If XQ is any approximation to x = A -1 y, then Ax Q = y r while 

A(x Q + u) = y r + nu. 

Hence if n is very small, a large component in x along u would appear in 
r<j as only a small component in the same direction. Another way of 
saying this is to say that a putative solution XQ might yield a residual TO 
that would be regarded as negligibly small even when XQ has a large 
erroneous component along u. 

In general, for any matrix if x and x\ are two putative solutions, then 



and if m(A) is small, then a large difference xi x 2 could result in only a 
small r 2 r\ t possibly less than the maximum round-off error. In fact, 
if is the limit of the round-off error, then e/m(A) represents the limit of 
detectable accuracy in the solution. 

There is no a priori assurance, however, that any particular method of 
solution will give a result that is even that close. We therefore consider 
this question for some of the iterative methods described above. It will 
be assumed, for definiteness, that the operations are fixed-point with 
maximal round-off , all numbers being scaled to magnitude less than 
unity, and that in the multiplication of vectors and matrices it is possible 
to accumulate complete products and round off the sum. If each product 
is rounded, then generally in the estimates given below the factor must 
be multiplied by n, the order of the matrix. 



MATRICES AND LINEAR EQUATIONS 61 

In any iterative process which utilizes one approximation to obtain one 
that is theoretically closer, the given approximation actually utilized in 
the computation, however it may have been obtained, is digital. To the 
digital approximation one applies certain pseudo operations to obtain 
another digital approximation. Two partially distinct questions arise: 
Given a digital approximation and a particular method of iteration, can 
we be sure that the next iteration will give improvement? Given two 
digital approximations, however obtained, when can we be sure that one 
is better than the other? These are questions relating to both the 
generated and the residual errors, since for iterative methods they merge 
together. 

Basic to the discussion is the fact that, when a product Ax<>, say, of a 
digital matrix by a digital vector, is rounded off by rounding only the 
accumulated sums and not the separate products of the element, then the 
resulting digital vector, which will be designated (Axo)*, satisfies 

- (AXo)*] < , 



An additional factor n would appear if each product of elements were 
rounded. Likewise, for the multiplication of two digital matrices the 
digital product satisfies 

N 6[AC - (AC)*] < 6, 



2.1321. The Seidel Process. We consider only a positive definite 
matrix A. The process is based upon the fact that the vector x which 
satisfies the equations is the vector x which minimizes the function 

x"*(Ax 2y) = x*(y -f- r). 

Hence it maximizes x' r (y + r). Hence, given two approximate solutions 
XQ and o?i, we shall say that #1 is a better approximation than x<> provided 

(2.1321.1) x\(y + r t ) > x](y + r ), 

and if the two quantities are equal, the two approximations are equally 
good. However for making the test in a particular instance, there will be 
available only the vectors 

r* = y - (Ax p r, p = 0, 1. 
By (2.132.2), 

N(r p - rj) - N[(Ax p )* - Ax p ] < 
and therefore 



~ r*)| < N(x p )N(r p - r*) 
Also 

rj) + x\(r 9 - rj), - 
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Hence 

r ) - x\(y + n) = *l(y + fj) - I(0 + rf) + ^0(^*0 - rj) 



- rf), 
and (2.1321.1) will certainly be satisfied if 

(2.1321.2) x\(y + rf) - xl(y + rj) > n^[N(x ] 
Since we can also say that 

\xl(r P - r *)l ^ w&(z p )b(r p - rj) < 
therefore (2.1321.1) is also implied by 

(2.1321.3) x\(y + rf) - *J(y + rf) > n[6(zo) 

This requirement is somewhat more stringent. 

Now consider a particular approximation x and the digital approxima- 
tion that would be obtained from XQ following a single projection. Can 
we be assured that the digital result of making the projection will be a 
better approximation than #0? If the projection is-made on e<, we wish 
to know whether 

(*o + X*e;) T (2/ + \y~ A(x Q + X*e<)]} > x\(y + r ), 
where 

A == ^7*0 ?" 



We suppose every an = 1. This does not violate the requirement that all 
stored quantities be in magnitude less than unity since the an need not 
be stored explicitly in this case. Hence 

X* = ejrj, X = fijro. 
The above inequality reduces to 

2X*X > X* s . 
For X* > 0, this is equivalent to 

X* > 2(X* - X), 
and for X* < it is equivalent to 

X* < 2(X* - X). 

Since |X* X| < e, either condition is assured by 
(2.1321.4) |X*| = |ejrj| > 2c. 

If |X*| < 2e, then X* = ejr* = 0, and no change is made in a? ; if |X*| = 2, 
then at least the modified vector is not worse than a? . Hence in spite of 
the round-off, no step in the process can yield a poorer approximation, 
and in general any step will yield a better one until ultimately some 
r* - 0. 
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2.1322. Iteration with an Approximate Inverse. Next consider an 
arbitrary nonsingular matrix A with a given approximate inverse C and a 
given approximation XQ to x = A~ l y. Elements of the inverse will be out 
of range if the elements of A are digital. Hence C must be stored in the 
form /8~ 7 (7, where 7 is some positive integer, and the elements of this 
matrix will be assumed digital. Also 7 will be supposed large enough so 
that all prescribed operations yield numbers in range. We suppose y 
scaled so that x and any approximation are in range. Hence x is sup- 
posed digital. 

As a measure of magnitude of a vector r, we use 6(r), and associate with 
it a measure of magnitude of a matrix A, denoted c(A), and defined by 

c(A) = max > |a;y|. 



J 



One verifies easily that for any two matrices A and B 

c(AB) < c(A)c(B), 
c(A + B) < c(A) + c(B). 

If we form a matrix having any vector r in the first column and zero else- 
where, and apply the first of these inequalities, we conclude that 

b(Ar) < c(A)b(r). 
Moreover 

c(A) = max b(Ar)/b(r), 

r^O 

though this property will not be required. 

If XQ and xi are any two digital approximations to x, we ask first under 
what conditions we can be assured that &(n) < b(r ). Since we calculate 



r* = y - (Axp)*, p = 1, 2, 
we can be assured of the relation only when 

b(rj) + 6(n - rf) < 6(r?) - b(r - r*). 

But each element of r* can be in error by an amount c (if individual 
products are rounded, it is ne), whence the condition is 

&(rf) < &(rj) - 2e. 

When the equality holds, then at worst b(ri) = 6(r ). 

Now suppose we have the approximation XQ and wish to decide whether 
to attempt to improve the approximation by forming XQ + O . Are we 
assured of obtaining a better approximation? Actually we form a 
digital vector 
* 
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and the question is whether this is a better approximation than a; . We 
have identically 

(2.1322.1) ri - r - rj + B*rJ + (B - B*)r* 

where 

B - / - AC, B* = ft 

Hence an improvement is assured if b(ro) exceeds the b function of the 
right member of (2.1322.1), and this is certainly true when 6(rJ) 
b(ro r*) exceeds the same quantity. Hence the condition is 



(2.1322.2) 6(rJ) > (26(r - rjf) + Fc(A)b\p-iCrt - (ft- *Cr?) *] } / 

- c(B*) - c(B - 



In this relation c(A), c(B*), and 6(rJ) can be evaluated directly while the 
other quantities are limited by the computational routine. 

By the contemplated routine of rounding off the accumulation of 
products, each element of (j8~ 1r Cr*)* can differ from ft~ v Cr* by as much 
as , whence 



As for B*, each element in [A(p~^C)]* can be in error by c, and n terms 
contribute to the C function. Hence 



c(B - B*) < 
The condition required is therefore 
(2.1322.3) 6(rf) > e[2 + pc(A)]/[l - ntp - c(B*)]. 



The dominant term in the numerator is probably p*c(A). 

A slight modification of the routine can improve the situation by 
"damping out" the term p y c(A). Since rj is presumably small, it should 
be possible to scale it up by a factor /J 8 , forming (p^Cr*)*. If this is 
done, one has 



and in place of (2.1322.3), 

(2.1322.4) &(rf) > e[2 + p-*c(A}}/[\ - nip - c(B*)]. 



On consecutive iterations as the residual diminishes, 8 can be increased, 
possibly even until the term p~*c(A) becomes negligible. In the denomi- 
nator, c(B*) can be reduced, if desired, by iterating to improve the 
approximate inverse. But the term ntf"* is not at our disposal. 

By a further modification of the routine, the factor fi~ B can be brought 
before the entire numerator. If the products required in forming Ax* 
can be accumulated before rounding, the accumulation can also be sub- 
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tracted from y and the result multiplied by 0* before rounding. This 
gives 



rf ) < 



If this is done, the condition 

(2.1322.5) 6(rJ) > tf-[2 + pc(A)]/[l - n& - c(B*)] 



is sufficient to ensure improvement. Since the stored quantities are 
elements of /3 5 r*, the elements of r* can be made actually less than the 
maximal round-off. 

2.2. Direct Methods. A direct method for solving an equation or 
system of equations is any finite sequence of arithmetic operations that 
will result in an exact solution. Since the operations one actually per- 
forms are generally pseudo operations, the direct methods do not gen- 
erally in practice yield exact results. Nevertheless, the results may be as 
accurate as one requires, or it may be advantageous to apply first a 
direct method after which the solution may be improved by the applica- 
tion of one of the iterative methods. 

Certainly all (correct) direct methods are equivalent in the sense that 
they all yield in principle the same exact solution (when the matrix is 
nonsingular and the solution unique). Nevertheless the methods differ 
in the total number of operations (additions and subtractions, multipli- 
cations, divisions, recordings) that they require and in the order in which 
these take place. As a consequence they differ also as to the opportuni- 
ties for making blunders and as to the magnitude of the generated error. 

Most direct methods involve obtaining, at one stage or another, a 
system of equations of such a type that one equation contains only one 
unknown, a second equation contains only this unknown and one other, a 
third only these and one other, etc. The procedure for solving such a 
system is quite obvious. The matrix of such a system is said to be 
triangular (or semidiagonal), since either every element below the prin- 
cipal diagonal, or else every element above the principal diagonal, has 
the value zero. A matrix of the first of these types is upper triangular; 
one of the second is lower triangular. If it happens, in addition, that 
every diagonal element is equal to 1, then the matrix is unit upper tri- 
angular or unit lower triangular, as the case may be. We shall say that 
a matrix M is of triangular type, upper or lower, if it can be partitioned 
into one of the two forms 



(2.2.1) M = ( M \ M = 

v \ o MZZ/ 




where Mu and Af 22 are both square matrices, as is M itself. We shall 
consider such matrices briefly. 
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2.201. Matrices of triangular type. If M is of upper triangular type, 
then Af T is of lower triangular type. Hence it is sufficient to consider 
just one of these types. One verifies directly that, if M and N are both 
of upper triangular type when similarly partitioned (i.e., corresponding 
submatrices have the same dimensions) , then the product MN is of upper 
triangular type when similarly partitioned. If, further, Mn and M 22 are 
nonsingular, then M~ l exists, and is of upper triangular type when simi- 
larly partitioned. In fact, 

(2.201.1) 

Hence if M n is upper triangular and Mzz a scalar, then M itself is upper 
triangular and not merely of upper triangular type. In this case (2.201 .1) 
provides a stepwise procedure for inverting a triangular matrix, if M 22 
is a scalar while Mn is a matrix which has been inverted. 
If a matrix A is partitioned as 

(2.201.2) A = 

with An nonsingular, then A can be expressed in many ways as the 
product of matrices of triangular type in the form 






/An Ai 2 \ /Nn \/Mn Mi2\ 
(2.201.3) ( . , ) = ( A r AT )( n , I 

\Azi AM/ \Nzi ^22/\ M 22 / 



In fact, for an arbitrary nonsingular Nn (of the dimensions of An], 
MID Mi2, and Nn are uniquely determined independently of A 22 and of 
the selection of Nn and ^22. For one verifies directly that 



(2.201.4) 

while ^22 and M 22 are restricted only by the relation 

= A 22 AziA^Avi. 



This being the case, we can give an inductive algorithm for a factoriza- 
tion of A into the product of a unit lower triangular matrix L and an 
upper triangular matrix W. That such a factorization exists and is 
unique when A is of second order and A n 7* follows from the above by 
taking Nu =Nw = 1. For purposes of the induction suppose that Nn 
above was unit lower triangular and Mn upper triangular. Then M 12 
and Nzi are uniquely determined by (2.201.4). We change the notation 
and partition further, writing 



/An 
1^21 

\Asi 



Wi, 

i If II 

(2.201.5) 

i it i \ 

.32 A.88/ VL 3 1 Z/32 I/33/ V) 
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where Lu = Nu, Wu = M u , An is the same as above, but the sub- 
matrices previously designated A 22 , A 2 i, and Ai 2 are now further par- 
titioned, as are the matrices Nn and M i 2 . When the necessary inverses 
exist, these last matrices or their own submatrices are determined 
uniquely by Eqs. (2.201.4) which now have the form 



, L 3 i = A 
Four conditions remain to be satisfied. Of these, three give 



(2.201.6) TF 23 = LsH^M - L 21 TF 13 ), 

L 32 = 



Hence Tf 22 , TF 23 , and L 52 can be determined uniquely from A and from the 
portions of L and W already determined, provided only that A 22 L 2 iTFi 2 
is nonsingular, and independently of the choice of the matrices L 33 and 
Wss. The last condition merely specifies the product LwWw. Hence 
for the inductive algorithm take L 22 = 1 and determine the scalar 
and the vectors WM and Ls 2 . Now the matrices 





are unit lower and upper triangular matrices, respectively, to be desig- 
nated Ln and Wu for the next steps. The process fails if at some stage 
W 22 = 0. If A is nonsingular, it is always possible to rearrange rows and 
columns and continue. 

By applying the process to A T , we note that we could equally well make 
W a unit upper triangular matrix and L lower triangular, and again the 
factorization is unique. 

When A is symmetric and L is unit lower triangular, let D 2 represent 
the diagonal matrix of elements of the principal diagonal of W. Then 



= L T , 
and the factorization can be written 

A = LD*U. 
If we write 

LD = K, 
then 

A = KK\ 

If A is not positive definite, then D 2 will have negative elements, and 
hence D will have pure imaginary elements. However this presents no 
real computational difficulty. 
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We conclude this section by noting that, if B is an arbitrary matrix 
of m < n linearly independent columns, there exists a unique unit upper 
triangular matrix V of order m such that the columns of 

(2.201.7) , R = BV~ l 

are mutually orthogonal with respect to the matrix G. This is to say 
that the matrix 

(2.201.8) R*GR = D 

is a diagonal matrix. To make the proof inductive, and exhibit the 
algorithm, suppose this has been accomplished for the matrix B with 
m < n, and suppose that the vector b is independent of the columns of 
B. We wish then to select vectors r and v so that 

(2.201.9) 
and 



These conditions are satisfied by taking 
(2.201.10) v - D-WGb, r = b - Rv. 

Geometrically, this process amounts to resolving b into a component in 
the space of the columns of B (or of R} and a component orthogonal to 
this space, the latter component becoming the vector r. 

Most of the classical direct methods for solving systems of linear 
equations can now be deduced almost immediately. 

2.21. Methods of Elimination. In elementary algebra one learns to 
solve a system of equations by "the method of elimination by addition 
and subtraction. " In this method an equation is selected in which the 
coefficient of the first unknown 1 is non-null, and one adds an appropri- 
ate multiple of this equation to each of the others in turn so that 1 
is eliminated from these equations. The resulting n 1 equations, 
together with the one used for the elimination, constitute a new system 
equivalent to the first system, and the n 1 equations contain only the 
n 1 unknowns 2 , . , n. The same process applied to these 
yields n 2 equations containing only the n 2 unknowns 3 , . . . , &. 
Eventually one obtains a single equation in n alone. The final solution 
is now obtained by "back substitution," in which the value of obtained 
from the last equation is substituted into the preceding which can then 
be solved for n _i, these are then substituted into the one before, etc. 
Thus the elimination phase followed by the back-substitution phase 
yields the final solution. 

In the elimination phase, the operation of eliminating each of the 
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unknowns is equivalent to the operation of multiplying the system by a 
particular unit lower triangular matrix a matrix, in fact, whose off- 
diagonal non-null elements are all in the same column. The product of 
all these unit lower triangular matrices is again a unit lower triangular 
matrix, and hence the entire process of elimination (as opposed to that 
of back substitution) is equivalent to that of multiplying the system by a 
suitably chosen unit lower triangular matrix. Since the matrix of the 
resulting system is clearly upper triangular, these considerations consti- 
tute another proof of the possibility of factorizing A into a unit lower 
triangular matrix and an upper triangular matrix. 
For the system 

Ax = y, 

after eliminating any one of the variables, the effect to that point is that 
of having selected a unit lower triangular matrix of the form 



/Lii \ 

\Z/12 /22/ 



where Ln is itself unit lower triangular, in such a way that A is factored 







22 21 22 

with Wn upper triangular but M 22 not. Hence 

(2.21.2) Mn = An - AiiAifAu. 

The original system has at this stage been replaced by the system 



(2.21.3) " 




)(*')-(*) 

/ VV \22/ 



where 

(2.21.4) = - V . 




The matrices Ln and L 2 i are not themselves written down. The partial 
system 

02 



represents those equations from which further elimination remains to be 
done, and this can be treated independently of the other equations of the 
system, which fact explains why it is unnecessary to obtain the L matrices 
explicitly. 

If the upper left-hand element of M 22 vanishes, this cannot be used 
in the next step of the elimination, and it is not advantageous to use it 
when it is small. Hence rows or columns, or both, in Af must be 
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rearranged to bring to this position an element that is sufficiently large. 
Corresponding changes must be made in the notation. 

Grout's method differs in that the L matrices are written down explic- 
itly at each stage while M 22 is not. It utilizes the inductive algorithm 
given in (2.201.6) where, as we have seen, each successive column of L 
and row of W can be obtained from those previously computed together 
with the corresponding row and column of A. In order to compute the 
vectors as one goes along, one writes the augmented matrix (A, y}. 
Then the partitioning of (2.201.5) is extended to the following: 







n 



31 



w 



Wu Wit Wn 
TP 22 Wn 2 2 




(Ln 

= I Z/2i Z/22 

\Lsi 1/32 I/33/ 

and supposing L n , Z/ 2 i, LSI, W n, W 12, ^13, z\ already determined, I/ 2 2 
is prescribed (in practice L 2 2 = 1), and L 32 , Wzz, W^, 2 2 are to be deter- 
mined at this step. Equations (2.201.6) give L 32 , TF 2 2, and JF 2 3, while 2 2 
is given by 



(2.21.6) 



2 2 = 



While in practice one takes L 22 = 1, this equation and Eqs. (2.201.6) 
are perfectly general. Since neither L 33 , TF 3 3, nor 2 3 occurs in any of 
these relations, one can, with Grout, write the two matrices L I and 
(W, 2) in the same rectangular array, filling out in sequence the first 
row, the first column, the second row, the second column, etc. When 
this array is filled out, the elements along and to the right of the principal 
diagonal are the coefficients and the constants in the triangular equations 
W x 2. 

In case one has two or more sets of equations with the same matrix A , 
then the vectors y and 2 may be replaced by the matrices Y and Z in 
(2.21.5) and (2.21.6). Alternatively one may solve one of these systems, 
after which, with L and W already known, the elements of any other 
column 2 in Z are obtained sequentially from the corresponding column y 
in Y by using (2.21.6), remembering that at the start there is no partial 
vector 21 so that one has simply ft = ^ In particular, if a single system 
is solved by this method, and a result XQ is obtained which is only approxi- 
mate because of round-off errors, we have seen that the error vector 
x XQ satisfies a system with the same matrix A, so that (2.21.6) can be 
applied with y AXQ replacing y. 

Another modification of the method of elimination is that of Jordan. 
It is clear that, after 1 has been eliminated from equations 2, . . . , n, 



MATRICES AND LINEAR EQUATIONS 71 

and while the new second equation is being used to eliminate & from 
equations 3, . . . , n, this can also be used to eliminate 2 from the first 
equation. Next the third equation can be used to eliminate 3 from what 
are now equations 1 and 2, as well as from equations 4, . . . , n. By 
proceeding thus, one obtains an equivalent system of the form Dx w, 
where D is diagonal. This amounts to multiplying the original system 
Ax = y sequentially by matrices each of which differs from the identity 
only in a single column. However this column will have non-null ele- 
ments both above and below the diagonal. 

Grout's method provides a routine for triangular factorization which 
minimizes the number of recordings and also the space required for the 
recordings. This is very desirable, whether the computations are by 
automatic machinery or not. For machine computation it has the dis- 
advantage of requiring products such as LziWiz involving elements from a 
column of L and from a row of W. Jordan's method permits a similar 
economy of recording without requiring operations upon columns. 

To see this we note first that, if J is a matrix satisfying 

J(A, I) = (/, J), 

then certainly J = A~ l . In words, if it is possible to find a sequence of 
row operations which, when performed upon the matrix (A, 7), reduces 
it to a matrix (7, J) , then J is the required inverse of A . Jordan's method 
forms (7, /) stepwise from (A, 7) by multiplying on the left by matrices 
of the form / -f- J tj where J differs from the null matrix in the ith column 
only. The process will be complicated somewhat in "positioning for 
size," but we neglect this here and assume that the process can be carried 
out in natural order from the first column to the last. Then the sequence 
starts with the operation 



and continues with 
and generally 



We now observe that in A the first i columns are the same as those of J, 
and in Ki the last n i columns are the same as those of /. Thus one 
needs to record only the first i columns of K t and the last n i columns 
of Ai. The ith column of J\, is to be selected so that the t'th column 
of the product (/ + 7)-A-i is equal to e-, and need not be recorded. 
Instead one records the ith column of 7 + /, which becomes the ith 
column of Ki. If the ith column of A_i is a\, . . . , a, and the ith 
column of 7 + Jt is <i, . . . , <> then fa = a? 1 , and <f>j //< 
Hence in forming the composite matrix of nontrivial columns of Ai 



(I + Jd(Ai- lt Ki-i) = (Ai,Kt). 
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and of Kt, one first forms the tth row from the tth row of the previous 
composite. For this one divides every element but the t'th (which is 
at) by <, recording the quotient, and in the t'th place records a? 1 . To 
obtain the jih row (j 7* t), one increases each element except the tth by 
a/ times the corresponding element in the new tth row. In the tth 
place one records <fo ,/<** = aj/on. 

Clearly if one operates in this fashion upon the matrix (A, 7, y), then 
one comes out with (/, A" 1 , x). Thus in using automatic machinery if 
n(n + 1) places are reserved in the memory for (A" 1 , a?), then these places 
are to be filled first by the matrix (A, y) arranged by rows. Each multi- 
plication by an I + J,- requires first an operation upon the elements of 
the t'th row, followed by an operation upon the elements of the old jth 
and the new t'th row. 

2.22. Methods of Orthogonalization. Let 

(2.22.1) A = RV, 



where V is unit upper triangular and D 2 is diagonal. We have seen in 
2.201 that such matrices exist. The general metric G of 2.201 is here 
taken to be /. The matrices V and R can be computed sequentially by 
applying Eqs. (2.201.9) and (2.201.10) with appropriate modification of 
notation. Then the equations Ax y can be written RVx = y so that 
D*Vx /2 T / and 



(2.22.2) x = F-' 

Since D is diagonal and V unit upper triangular, their inversion is straight- 
forward. This is Schmidt's method. 

In the least-squares problem one has a matrix B, with m < n rows, 
and a vector y, arid one seeks a vector x of m elements such that 

Bx - y + d, d*B = 0. 

Geometrically the vector y is to be projected orthogonally upon the space 
of the columns of B, and the components x of the projection are required. 
Since 

B^Bx 



these equations yield the required x. If, however, 

B - RV, R J R = D 2 , 
then 

(T# - 0, 
whence 

and 
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The orthogonalization process can therefore be applied directly to the 
matrix B, and B*B is not required explicitly. 
To return to the system Ax = y, we may equally well write 

(2.22.3) A = US, SS* - D\ 

where U is unit lower triangular. Let w satisfy 



x = 
Then the equations can be written 

USS^w - UD*w = y 
so that 

w = 
and therefore 

(2.22.4) x = 

In this method the rows of S are orthogonal combinations of the rows 
of A, and since # T = w*S, the vector x 1 is expressed as a linear combina- 
tion of these orthogonal row vectors. 

If A is positive definite, we could attempt to use A, or possibly A" 1 , as 
the metric to obtain an orthogonalized set of vectors along which x 
might be resolved easily, and in fact from 2.201 we can form matrices 
R and V such that 

7 = RV, R*AR - Z> 2 , 
so that if 

x = Rw, 
then 



w 

x = RD~*R*y. 

Indeed, R = V** 1 , and it is therefore unit upper triangular. Hence the 
relation R*AR = D 2 is equivalent to A V*D*V, which is the triangular 
resolution already obtained but arrived at in a different way. 

An orthogonalization process of somewhat different type has been 
devised by Stiefel and Hestenes, independently. The process leads to a 
fairly simple iteration, which, however, terminates in n steps to yield 
the exact solution apart from round-off. Since the n steps yield progres- 
sively better approximations to the true solution, the process can be 
continued beyond n steps for reduction of the round-off error. 

The first step, as applied to a positive definite matrix A, is the same 
as in the method of steepest descent, in that one starts with an arbitrary 
initial approximation x and improves it by adding a multiple of the 
residual r . Thereafter, however, instead of adding to each #,- a multiple 
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of r, one adds a vector so chosen that n+i is orthogonal, with respect to 
the metric /, to all preceding r,-. If this can be accomplished, then for 
some m < n y r m = 0, and hence Ax m = y. For if all the vectors r , n, 
. . . , r n _i are non-null, then being mutually orthogonal they are linearly 
independent, and only the null vector is orthogonal to all of them. 

Geometrically the method has other points of interest. We have 
already noted that the solution x of the equations Ax = y minimizes the 
function 

(2.22.5) 2/(z) = x*Ax - 2x*y. 

In fact it represents the common center of the hyperdimensional ellipsoids 

(2.22.6) f(x) = const. 

This fact provides the usual approach to the method of steepest descent. 
Also at x the function f(x) is varying most rapidly in the direction of 
r , which is the gradient at XQ of the function f(x). Hence one takes 

x\ = XQ 4 <x ro, 

where o minimizes the function f(x Q 4 <*r ) of the single variable a. 
It can be shown that the point x\ is the mid-point of the chord through 
o?o in the direction r of that particular ellipsoid f(x) = /(#o) which passes 
through iCo. It is easy to write the equation of the diametral plane which 
bisects all chords in the direction r . This is a diametral plane of the 
ellipsoid f(x) = f(xi) through x\ t as well as a diametral plane of 



and it intersects the original ellipsoids in hyperdimensional ellipsoids 
whose dimensionality is one less than that of the original ellipsoids. 
Stiefel and Hestenes now improve the approximation Xi by adjoining 
a vector in the direction of the gradient to the lower dimensional ellipsoid, 
which is the orthogonal projection upon the diametral plane of the 
gradient ri to the w-dimensional ellipsoid. One proceeds then to get 
ellipsoids of progressively lower dimensionality until one finally reaches 
the center itself. 

The success of the method depends upon a theorem (Lanczos) which 
will have application also in other connections. Beginning with a 
vector TO, suppose one seeks to orthogonalize the vectors ro, Ar<>, AVo, 
... by selecting a set of vectors ro, ri, r^, . . . in such a way that r,- + i is 
a linear combination of the vectors ro, Aro, . . . , AVo, and r+i is orthog- 
onal to ro, fi, . . . , r t . At most n such vectors will be non-null, and 
we have already shown how any set of linearly independent vectors can 
be orthogonalized. We can then express r i+ i as a linear combination of 
the vectors r , n, . . . , r t - and of Ar: 

+ p-+i,iri 4- + Pt+i.t-r*- -|- 
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For i the statement is trivial, for it merely says that r\ is a linear 
combination of r<) and Aro. Suppose that all vectors ro, fi, . . . , r,- are 
non-null and that the statement holds for them. Then p ?^ 0, since 
otherwise the mutually orthogonal vectors ro, . . . , r,-+i would be 
linearly dependent. Since r* is a linear combination of ro, Ar , . . . , 
A*" 1 ^, therefore Ar t - is a linear combination of Aro, AVo, . . . , AV . 
Hence r + i is expressed as a linear combination of ro, Aro, . . . , AVo. 

The theorem in question states that 

I 

Pt+1,0 = Pi+1,1 = ' = p,'+l. t -_2 = 0. 

For suppose the resolution made. Then since p t 7* 0, each Ar t - is a 
linear combination of r , n, . . . , r,-+i. Hence Ar t is orthogonal to 
every r, for j > i + 1. Hence 

r^Ari = 0, I.; - i\ > 1. 

Hence An is a linear combination of only r_i, r, and r t -+i, which proves 
the theorem. 

After simplifying the notation, we can therefore set 

(2.22.7) r i+ i = 7<(r< - /5,-_ir t -_i - a,-Ari). 

We would like to arrange it so that these vectors n are residuals y 
Then we require that 



y 
If we impose the condition that 

(2.22.8) 7< (1 - ^_0 = 1, 
then we can achieve this by taking 

(2.22.9) Xi+i = Vifa - (Si-iXi-i + ai r t -). 
For i = we have 

(2.22.10) n = r ~ 

0:1 = XQ + 

Let 

(2.22.11) Pi = rjr<, < 

When we apply the orthogonality criterion, we find first 

(2.22.12) 

Also 



so that by reducing subscripts 
(2.22.13) 
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Hence, since 

therefore 

(2.22.14) - 

Hence, beginning with , and 70 = 1, we can find ri, then a\ t /So, 7i, and 
hence n, and so on sequentially. 
From (2.22.7) we can write 



Therefore 

(2.22.15) 
and hence 

(2.22.16) r <+ i = Ti - \iAzi, 

(2.22.17) 



where 

(2.22.18) X = T, /*< 
It can be shown inductively that 

(2.22.19) zJAz,- = 0, IT* j 
To begin with, from 

20 = r O, 



we have 

and 

r\r\ = 

Hence, elimination of r\Az Q gives with (2.22.11) 

z\Az\ = pi/Xo 



and this is seen to vanish from (2.22.18), (2.22.14), and (2.22.12). Now 
suppose (2.22.19) verified for all j < i < k. From (2.22.16) we have 

r]A* = 0, j < i - 1, j > i -f 1, 
(2.22.20) 
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Hence, from (2.22.17) with i = k we have the required relation verified 
when i = k + 1 and j < k 1. 
Again, from (2.22.17) with i = k 

zlAzk+i = zlAr k+ i 4- p&lAzk. 

But from (2.22.17) with i = k - I and from (2.22.20) 
(2.22.21) z\Az k = z[Ar k = p*/X fc , 

whence, again with (2.22.20), 

= p*+iA* + /i*p*/Xfc = 0. 



This completes the proof of (2.22.19). 
If we set 

(2.22.22) n = 

then from (2.22.16) and (2.22.17) we can calculate sequentially 2 
X , ri, MO, zi, Xi, r 2 , MI, 22, ... from the formulas 



(2.22.23) Xi 

These equations are obtained by making use of (2.22.20) and (2.22.21). 
From these it is clear that 

Xi > 0, Mt > 0. 

Hence, from (2.22.18) 7* > and ft > 0. Hence 

1 > ft > 0. 

We can now relate the method to the minimizing problem. The 
ellipsoid 



passes through the point x ; as X varies, the points XQ + Xw lie on the 
secant through XQ in the direction u, and this secant intersects the ellipsoid 
for X = and again for 

X = X 7 = 



To see this we have only to solve the equation f(x -f Xw) = f(x Q ) for X. 
If we take u r , then X' = 2Xo. Hence Xi is the mid-point of the chord 
in the direction r . 

Now the plane r]A(x Xi) =0 passes through #1 and also through 
the point x A^y, for by direct substitution the left member of this 
equation becomes rjri which vanishes because of orthogonality. This 
plane is a diametral plane of the ellipsoid f(x) f(xi) ; it intersects this 
latter ellipsoid in an ellipsoid of lower dimensionality. Instead of choos- 
ing x 2 to lie on the gradient to f(x) = /(#i), as is done by the method of 
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steepest descent, the method of Hestenes and Stiefel now takes Xz to lie 
on the orthogonal projection of the gradient in this hyperplane, or, what 
amounts to the same, along the gradient to the section of the ellipsoid 
which lies in the hyperplane. At the next step a diametral space of 
dimension n 2 is formed, and 3 is taken in the gradient to the section 
of the ellipsoid f(x) = f(x z ) by this (n 2) space. Ultimately a diame- 
tral line is obtained, and x n is the center itself. With the formulas 
already given these statements can be proved in detail, but the proof will 
be omitted here. 

2.23. Escalator Methods. Various schemes have been proposed for 
utilizing a known solution of a subsystem as a step in solving the complete 
system. Let A be partitioned into submatrices, 

(2.23.1) A = (J fy, 

and suppose the inverse A^} is given or has been previously obtained. If 

(2.23.2) A- 1 = 
then 

AC 




//n 12 \ 

\0 2 i 7 22 / 



where the 7 M and the 0# are the identity and null matrices of dimensions 
that correspond to the partitioning. Hence if we multiply out, we obtain 



iz -[- AI^CZZ Oi 2 , 

A r 

11 -f- ^22^21 U21, 



*"* following solution of the system can be verified: 

C 22 = (A 22 AziA^A 
(2.23.4) n ". . ' 



If A 2 2 is a scalar, A 12 a column vector, and A zi a row vector, then C 22 
is a scalar, Ci 2 a column vector, and C 2 i a row vector, and the inverse 
required for C 22 is trivial. The matrices are to be obtained in the order 
given, and it is to be noted that the product ArfAn occurs three times, 
and can be calculated at the outset. If A is symmetric, then 



MATRICES AND LINEAR EQUATIONS 79 

In any event the matrix CM is of lower dimension than C, and the required 
inversion more easily performed. It is therefore feasible to invert in 
sequence the matrices 




each matrix in the sequence taking the place of the AH in the inversion of 
the next. 

In the following section it will be shown how from a known inverse 
A~ l one can find the inverse of a matrix A' which differs from A in only 
a single element or in one or more rows and columns. It is clearly pos- 
sible to start from any matrix whose inverse is known, say the identity /, 
and by modifying a row or a column at a time, obtain finally the inverse 
required. However, these formulas have importance for their own sake, 
and will be considered independently. 

2.24. Inverting Modified Matrices. The following formulas can be 
verified directly: 

(2.24.1) (A + t/SF 1 )- 1 = A- 1 - A~ 1 US(S + 



(2.24.2) (A + t/S-'y 1 )- 1 = A- 1 - A- 1 U(S + 

provided the indicated inverses exist and the dimensions are properly 
matched. Thus A and S are square matrices, U and V rectangular. In 
particular, if U and V are column vectors u and v, and if the scalar S = 1, 
then 

(2.24.3) (A + uv 1 )- 1 = A- 1 - (A-*u)(v*A- l )/(l + v^A~ l u). 

If u = e*, then the iih row of uv j is V T , and every other row is null; 
if v = 6i, then the z'th column of uv* is u, and every other column is null ; 
if u = ffd, where <r is some scalar, and v = ej, then the element in the 
zth row and jth column of uv* is <r, and every other element is zero." sin 
the last instance, v 1 A~ l u is <r(cr l )ij, where (<*"% is the indicated el ... ., 
of A" 1 . We have then the interesting corollary that the matrix A + uv* 
becomes singular when a = I/ ("%. 

2.25. Matrices with Complex Elements. If the coefficients of a system 
of linear equations are complex, then the matrix can be written in the 
form A + iB, where A and B have only real elements. In general we 
may expect the solution to have complex elements. Hence the equations 
can be written in the form 



iB)(x 

where the vectors re, y, c, and d are all real. However, this is equivalent to 

(Ax - By) + i(Ay + Bx) = c + id, 
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and since the real parts and the pure imaginary parts must be separately 
equal, this is equivalent to the real system of order 2n: 

Ax By = c, 
Bx + Ay = d, 
or 



(A -B\(x\_(c\ 
\B A) \yj-\d)' 



Thus the complex system of order n is equivalent to a real system of order 
2n, since these steps can be reversed. The complex matrix A + iB is 
singular if and only if the system with c + id = has a nontrivial 
solution, and this occurs if and only if the real matrix of order 2n is 
singular. 

A complex matrix is called Hermitian in case A is symmetric and B 
skew-symmetric, i.e., in case 



= A, # T = -B. 
But then the real matrix of order 2n can be written 

'A 



\ 

)' 



and it is symmetric. Hence the complex matrix is Hermitian if and only 
if the corresponding real matrix is symmetric. A Hermitian matrix is 
positive definite if and only if for every non-null complex vector x -f- iy 
it is true that 

iB)(x + iy) > 0. 



This implies that the quantity is real. But if we evaluate the quantity on 
the left, we obtain 

x*(Ax - By) + y J (Bx + Ay) + i[x\Bx + Ay) - y<(Ax - By)]. 

Since A is symmetric and B skew-symmetric, the quantity within brackets 
vanishes, and the quantity in question is certainly real whenever the 
matrix is Hermitian. As for the rest, we have 



x\Ax - By) + tf(Bx -f Ay) - (x\ 



(A -B\(x\ 
\B A)\y) 



If this is positive for every choice of x and y, then the real matrix of order 
2n is positive definite. Hence a Hermitian matrix of order n is positive 
definite if and only if the corresponding real matrix of order 2n is positive 
definite. 

Throughout the discussion of methods of inverting matrices and solving 
systems of equations, we have tacitly assumed that all quantities were 
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real, though in fact many of the processes were equally applicable to the 
complex case when appropriate changes are made in the wording. How- 
ever, rather than complicate the exposition, we have preferred to treat 
only the real case and reduce the complex case to the real. 

2.3. Some Comparative Evaluations. For either inverting a matrix 
or for solving a system of equations, there is no single method that is 
clearly best for all matrices or all systems. Some matrices may have 
many null elements, especially those which result from the finite difference 
approximation to a differential equation. Analysis of the system may 
show that a method that would be highly inefficient in general would work 
admirably well for the particular case. On the other hand, if one has 
many systems to solve, all differing among themselves, it may be more 
efficient to use the same scheme for all of them than it would be to analyze 
each system as it arises before deciding upon how to proceed. This is 
especially true if one is using automatic computing machinery for which 
the arrangement of the program is a major task. 

When computing machinery is used, the method must be adapted to 
the machine. Generally speaking, the number of multiplications and 
divisions and the number of recordings of intermediate results together 
provide a rough over-all estimate of the efficiency of a computational 
scheme. It is possible to estimate these numbers as functions of n, the 
order of the system, but the functions may be discontinuous. That is to 
say, if n is small enough so that all quantities, initial, intermediate, and 
final, can be retained in the internal memory of the machine, the func- 
tions will be of one form. But if n is so large that the auxiliary storage 
must be utilized, then transfers must be made between the internal 
memory and the external, and additional operations may be required. 
In fact, it may be necessary to use an entirely different computational 
scheme. 

2.31. Operational Counts. The possible occurrence of null elements 
will be ignored. With this understanding we consider the number of 
operations required in the application of some of the methods discup^cL 
above. 

2.311. The method of Seidel and the method of relaxation. The equa- 
tions can be written in scalar form 



< = ?> 



2, 



The n(n + l)/2 divisions m/cm and ouj/aa (for a symmetric matrix) 
can always be done in advance. Thereafter each correction of a single 
& requires n 1 multiplications and a single recording provided the 
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products can be accumulated. For a complete Seidel cycle this is 
n(n 1) products and n recordings. The number of cycles required, 
however, depends upon the system, the starting values, and the required 
accuracy. 

2.312. The method of steepest descent. One requires at each step the 
product Axp-i, or n 2 products, the n 2 products Ar p , the n products 
rJ(Ar p ), the n products rjr p , the quotient r\r v /r\Ar p X p , and the n 
products Xpfp, as well as the various sums and differences. Counting 
multiplications and divisions as equivalent, this is 2n 2 + 2n + 1 product 
operations at each step. This is somewhat more than two complete 
Seidel cycles performed on a prepared system in which the indicated 
quotients have been taken in advance. As for recordings, one requires 
at least the n elements r p , the n elements Ar p) the products r\r p and 
rJAr p , the quotient X p , and the n elements x p , or altogether 3n + 3 
quantities, as compared with n for a complete Seidel cycle. 

2.313. The Stiefel-Hestenes method. The general step requires n 2 
products for Axi and n 2 for Ar^ n for rjr t - and n more for r\Ar^ and indi- 
vidual products and quotients for a and ft. In principle the iteration 
terminates in n steps with something over 2n 3 product operations. Each 
additional projection for reducing round-off requires something over 
2n 2 product operations. In recordings, each step requires at least the 
n elements of #, the n elements of r i} the n elements of Ar if and a few 
scalars. This is something over 3n recordings per step or 3n 2 for a com- 
plete cycle. In addition r t _i must be carried over from the previous 
step. 

2.314. The Crout factorization. One is to write 

(A, y} = L(W, z), 

where L is unit lower triangular and W upper triangular, then solve the 
triangular system 

Wx z. 



Consider Eqs. (2.201.6) supposing W\\ and Ln to be of order z, WM a 
scalar, and //? = 1. Then WM requires i products, as does each element 
in Wtz, making i(n i) products. Each of the n i 1 elements of 
L 32 requr es i products in L 8 iTFi2 and the quotient of the result by TF 2 2, 
making (* + 1) (n i 1) product operations. Finally, from Eq. 
(2.21.6) o ie requires i products for 2 2 . This is n(2i + 1) 2i z i 1 
product operations in all. Summing from i = to i = n 1, we get a 
total of n(n 2 l)/3 products to be formed. Solving the triangular 
system requires a quotient for n , a product and a quotient for n _i, . . . , 
or altogether n(n + l)/2 product operations. Altogether we have a 
total of n(2n + l}(n -h l)/6 product operations, or something over n 8 /3. 
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The recordings required are the triangular matrices L and W and the 
vectors z and x, or n 2 -f- 2n quantities altogether. 

To iterate the process in order to reduce round-off, formation of 
TO = y AXQ requires n 2 products; ZrVo requires n(n l)/2 (all 
multiplications, no divisions since L is unit lower triangular); and 
^r-ijr / -i ro requires n(n -\- l)/2, or altogether 2n 2 products and at least 
3n recordings. These give the corrections to # , so that an additional n 
recordings of a?i itself are required. If the matrix is symmetric, the 
operations are reduced by nearly one-half. 

2.315. Orthogonalization. In forming RV = A, 1FR = Z> 2 as in 2.22, 
suppose i columns of A have been orthogonalized. As in (2.201.9) 
and (2.201.10), i elements of the next column of v are to be found, each 
requiring n products and a division. Then the next column of R requires 
n(n + i) products, and n more are required for the next element of 
Z) 2 . Hence to orthogonalize the columns of A requires a total of 

w(4n 2 + n - l)/2, 

or approximately 2w 3 products. Beyond this one requires R*y with n 2 
products; n more products in multiplying this by Z>~ 2 ; and n(n l)/2 
more in solving the triangular system Vx = D~^y. Altogether it 
amounts to 2n*(n + 1) products. For recordings we require at least the 
w 2 elements of R', n(n l)/2 elements of F; n elements of D 2 ; andn 
elements each of # T y, of D~ 2 # T y, and of x. This makes n(3n + 7)/2 
recordings. 

2.316. Inverting a modified matrix. In Eq. (2.24.3), if u = e,-, and A~ l 
is given, then the inversion of A + uv* requires n 2 multiplications for 
t> T A~ l ; n quotients of 1 -f- v^A~ l u into the vector v J A~ l (or into A~ l u}\ 
n z products for multiplying the column vector by the row vector. Hence 
there are 2n 2 -f n product operations for modifying the inverse when a 
single column of the matrix A is modified. If one builds up the inverse 
by modifying a column at a time, then in the worst case ri*(2n + 1) 
products are required. However, if one starts with the identity, then 
in the first step, since 



only n quotients are needed and no other products. The ne^y inverse 
differs from / in only the ith row, so that many zeros remain if n is large. 
If the programing takes advantage of the presence of the zeros^the num- 
ber of products is reduced considerably. 

Once the inverse is taken, if a set of equations are to be solved, an 
additional n 2 products are needed. 

2.4. Bibliographic Notes. Most of the methods described here are 
old, and have been independently discovered several times. A series of 
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papers by Bodewig (1947, 1947-1948) compares various methods, includ- 
ing some not described here, with operational counts and other points of 
comparisons, and it contains a list of sources. Forsythe (1952) has 
compiled an extensive bibliography with a classification of methods. A 
set of mimeographed notes (anonymous) was distributed by the Institute 
for Numerical Analysis, and interpreted several of the standard iterative 
methods as methods of successive projection, much as is done here. This 
includes the iterative resolution of y along the columns of A, which is 
attributed to C. B. Tompkins. The same method had been given by 
A. de la Garza in a personal communication to the author in 1949. 

Large systems of linear inequalities have important military and 
economic applications. Agmon (1951) has developed a method of 
relaxation for such systems, which reduces to the ordinary method of 
relaxation when the inequalities become equations. 

Conditions for convergence of iterations are given by von Mises and 
Pollaczek-Geiringer (1929), Stein (19516, 1952), Collatz (1950), Reich 
(1949), Ivanov (1939), and Plunkett (1950). 

On the orthogonalizations of residuals see Lanczos (1950, 1951). For 
other discussions of the method of Lanczos, Stiefel, and Hestenes see 
Hestenes and Stein (1951), Stiefel (1952), Hestenes and Stiefel (1952). 

Grout (1941) pointed out the possibility of economizing on the record- 
ings in the triangular factorization. The author is indebted to James 
Alexander and Jean Hall for pointing out the possibility of a similar 
economy in recording in the use of Jordan's method. Turing (1948) 
discusses round-off primarily with reference to these two methods and 
refers to the former as the "unsymmetric Choleski method." The 
formulas apply to the assessment of an inverse already obtained. On 
the other hand, von Neumann and Goldstine (1947) obtain a priori 
estimates, but in terms of largest and smallest proper values. They 
assume the method of elimination to be applied to the system with posi- 
tive definite matrix A or to the system which has been premultiplied by 
A J to make the matrix positive definite. See also Bargmann, Mont- 
gomery, and von Neumann (1946). 

Dwyer (1951) devotes some little space to a discussion of errors and 
gives detailed computational layouts. Hotelling (1943) deals with a 
variety of topics, including errors and techniques. Lonseth (1947) 
gives the essential formulas for propagated error. 

Sherman and Morrison (1949, 1950), Woodbury (1950), and Bartlett 
(1951) give formulas for the inverse of a modified matrix, and Sherman 
applies these to inversion in general. 

A number of detailed techniques appear in current publications by the 
International Business Machines Corporation, especially in the reports 
of the several Endicott symposia. 
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In the older literature, reference should be made especially to Aitken 
(19326, 1936-1937a). 

An interesting and valuable discussion of measures of magnitude is 
given by Fadeeva (1950, in Benster's translation). In particular Fa- 
deeva suggests the association of the measures designated here as c(A) 
and b (x). 

The use of Chebyshev polynomials for accelerating convergence is 
described by Grossman (1950) and Gavurin (1950). 

On general theory the literature is abundant. Muir (1906, 1911, 1920, 
1923) is almost inexhaustible on special identities and special forms, and 
many of the results are summarized in Muir and Metzler (1930) . Frazer, 
Duncan, and Collar (1946) emphasize computational methods. Mac- 
Duffee (1943) is especially good on the normal forms and the character- 
istic equation. 



CHAPTER 3 
NONLINEAR EQUATIONS AND SYSTEMS 



3. Nonlinear Equations and Systems 

In the present chapter matrices and vectors will occur only incidentally. 
Consequently the convention followed in the last chapter of representing 
scalars only by Greek letters will be dropped here. The objective of this 
chapter is to develop methods for the numerical approximation to the 
solutions of nonlinear equations and systems of equations. With systems 
of nonlinear equations, the procedure is generally o obtain a sequence 
of systems of linear equations whose solutions converge to the required 
values, or else a sequence of equations in a single unknown. 

A major objective in the classical theory of equations is the expression 
in closed form of the solutions of an equation and the determination of 
conditions under which such expressions exist. Aside from the fact 
that only a limited class of equations satisfy these conditions, the closed 
expressions themselves are generally quite unmanageable computa- 
tionally. Thus it is easy to write the formula for the real solution of 
x g = 9, but if one needs the solution numerically to very many decimals, 
it is most easily obtained by solving the equation numerically by one 
of the methods which will be described. Nevertheless, certain principles 
from the theory of equations will be required, and we begin by develop- 
ing them. 

To begin with, we shall be concerned with an algebraic equation, which 
is one that can be written 

(3.0.1) P(x) = 0, 

where 

(3.0.2) P(x) = aox n + oiz"- 1 + + a n -\x + a n , 

a polynomial of degree n. Ordinarily we shall suppose that a Q 7* 0, since 
otherwise the polynomial or the equation would be of some lower degree. 
This being the case, we can always, if we wish, write 

(3.0.3) p(x) m a^P(x) = x n + ai x- 1 -}-+, 

and the equation 

(3.0.4) p(x) - 

is equivalent to the original one. 
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3.01. The Remainder Theorem; the Symmetric Functions. A basic 
theorem is the remainder theorem, which states that, if the polynomial 
P(x) is divided by x r, where r is any number, then the remainder is 
the number P(r). Thus suppose 

(3.01.1) -P(x) = (x - r)Q(x) + R, 

where Q(x) is the quotient of degree n 1, and R is the constant remain- 
der. This is an identity, valid for all values of x. Hence in particular 
it is valid for x = r, which leads to 

(3.01.2) P(r) = R. 

A corollary is the factor theorem, which states that, if r is a zero of the 
polynomial P(x), that is, a root of Eq. (3.0.1), then x r divides P(x) 
exactly. Conversely, if x r divides P(x), then r is a zero of P(x). 
For by (3.01.2) if r is a zero of P(x), then R = 0, and by (3.01.1) the 
division by x r is exact. The converse is obvious. 

The fundamental theorem of algebra states that every algebraic equa- 
tion has a root. The proof is a bit long and will not be given here. But 
it follows from that and the factor theorem that an algebraic equation 
of degree n has exactly n roots (which, however, are not necessarily 
distinct). For by the fundamental theorem (3.0.1) has a root, which 
we may call x\. By the factor theorem we can write 

P(x) (x - sOQifr). 

But Qi = is an algebraic equation of degree n 1; it has a root, say 
# 2 , and hence 

Q,(x) B (x - 
Eventually we get 

Qn-l(z) = (X 



where Q n -i is linear and Q n a constant. But then 
(3.01.3) P(x) = (x - xi)(x - x z ) - (x - x n )Q, 



and not only x\ but also x z , . . . , x n are roots of P = 0. But there can 
be no others. For if x n+ i were different from xi, . . . , x n , and also a 
root of P = 0, then it would be true that 

= P(x n +l) = (X n+ l Xi) ' ' * (Z n +i - n)Qn. 

But if Xn+i Xi 7* for i = 1, . . . , n } then Q n = and P = iden- 
tically. Hence there are exactly n roots, and the theorem is proved. As 
a partial restatement we can say that, if a polynomial of degree not 
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greater than n is known to vanish for n + 1 distinct values of #, then this 
polynomial vanishes identically. 

Now consider the factorization (3.01.3). If we multiply out on the 
right, the polynomial we obtain must be precisely the polynomial P. 
But the coefficient of x n on the right is Q n , so therefore Q n = a . By 
examining the other coefficients in turn, we find 



t- 0,1 = do / # 



2 = Oo } XiX 

(3.01.4) *<> 

tta = flo } 



j<k 
*n V ~ A 7 u/u^l^'2 * SMI* 



Thus ( l) h ah/a,Q is the sum of the ( , J products of the r's taken h at a 

time. These sums are called the elementary symmetric functions of the 
roots. They are symmetric because interchanging any pair of the roots 
leaves the value of the function unchanged. It is a theorem that any 
rational symmetric function is expressible as a rational function of the 
elementary symmetric functions. The general theorem will not be 
required here, but special cases will appear. Consequently we introduce 
the notation <TH for the elementary symmetric function of degree h: 

(3.01.5) a 



Of particular importance will be the sums of powers: 
(3.01.6) s h = 7 a*, 



where h is any integer, positive, negative, or zero. For h = we have 
So = n. Expressions for these in terms of the elementary symmetric 
functions will be given later. 

We conclude this section by noting that 

(3.01.7) x n P(l/x) = a n x n + x n -ix n ~ l + + o 

= ao(l XXi)(l XX2) (! XXn). 

Hence the equation 

(3.01.8) a n x n + + a = 

has the n roots a^ 1 , . . . , x% 1 , provided every # 7* 0. We call it the 
reciprocal equation. Then 
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n-l "O 

(3.01.9) a o - a V or-irc- 1 

N ' t*n 2 U>n / t',' *7 



3.02. 2Vie Derivative Equations. The derivative equations are those 
formed from the derivatives of P: 

P'(x) 0, 
(3.02.1) P"(x) = 0, 



If P is a real polynomial, i.e., if all its coefficients are real, then the real 
roots of the derivative equations have important relations to the real 
roots of the original. 

If we set x z + r, where r is any real number, forming P(z H-.r), 
expand each power of z + r, and collect like powers of z, the result is a 
polynomial in z of degree n. If in this polynomial we now replace z by 
x r but without expanding powers of x r, we obtain an expression 
of the form 



P(x) s c n -f c n _i(ic - r) + c n _ 2 (z - r) 2 + -f c Q (x - r) n , 



where the c's are the constant coefficients of the several powers of z in 
P(z + r), and in particular, c = a<>. This is an identity which therefore 
holds when we differentiate on both sides and continues to hold when we 
give to x any fixed value. In particular if we set x = r, we find (as in the 
proof of the remainder theorem) that 

P(r) - c n , 
and if we first differentiate i times and then set x r, 

P>(r) = ikn-i. 
Hence 

(3.02.2) P(x) = P(r) + (x - r)P'(r) + (a? - r) 2 P"(r)/2! + 

+ (x - r)P<>(r)/n!. 
This is Taylor's series for polynomials. 

If P(r) = but P'(r) 7* 0, then x - r is a factor of P(x), but (x - r) 2 
is not. Hence r is a simple root of P = 0. But if P(r) = P'(r) = 0, 
then r is at least a double root. Hence a root of P = is a multiple root 
if and only if it is also a root of P' 0. In fact, r is a root of multiplicity 
m if and only if 

- P(r) = P'(r) - 
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In that case (3.02.2) becomes 

(aO ss (x - r) 



Moreover, r is a root of multiplicity m I of P' 0, of multiplicity 
m - 2 of P" = 0, .... 

If P is a real polynomial, then between consecutive real roots of P = 
there is an odd number of roots of P' = 0. In particular, there is at least 
one. This is Rolle's theorem. For suppose Xi is a root of multiplicity 
mi, #2 of multiplicity m 2 . Then we can write 

P(x) ss (x - Xj) mi (x - x.2) m *Q(x), 

where Q does not vanish at x\ or # 2 or anywhere between. Since Q is a 
polynomial, it must retain the same sign throughout the interval. Now 

P'(x) = (a; - 
where 



q(x) = mi(x x 2 )Q + m z (x xi)Q + (x Xi)(x - 
Hence 



q(x 2 ) = 
But WiQ(#i) and m 2 Q(ic 2 ) have the same sign, whereas 



Hence q(x\) and 5(052) have opposite signs, and q(x) must vanish an odd 
number of times between x\ and # 2 . Hence the same is true of P'(x). 

If we differentiate the factored form (3.01.3) of P(x), we obtain for P' 
a sum of products of n 1 factors each. In fact, each product can be 
written as P(x)/(x x^ for some i. Hence 



(3.02.3) P'(x) m P(x) (x - xd~ l , 

or 

(3.02.4) P'(x)/P(x) = (x - Xi)- 1 p'(x)/p(x). 



But for re sufficiently large 

(a: Xi}~ 1 = or 1 + a:x~ 2 + rcjc" 1 
and 

S(# a;*)" 1 = nx~ l + Sior 2 + s 2 a;~ 8 
Since 

p(x) = x n <T\x n ~ l -f o- 2 rc n ~ 2 ' * * , 
p'(of) = nx n ~ l (n l)<7irc n ~ 2 + (n 
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Hence if we multiply p(x) by S (re a;*)" 1 and equate the coefficients to 
those of p'(x), we get the relations 

Si <TI = 0, 
+ 2<7 2 0, 
3o- 3 = 0, 
(3.02.5) ................ 

s n - Sn-i(ri + + (-l) M n<r n = 0, 

S n +p S n +p_l<Ti + ' ' -f (I) n 8pff n = 0. 



These are Newton's identities, expressing recursively the sums of powers 
of the roots as polynomials in the elementary symmetric functions, and 
hence as rational functions of the coefficients. If one applies the same 
relations to the reciprocal equation (3.01.8), one obtains the sums of the 
powers with negative exponents. 
If we set 



(3.02.6) QOO ss 11(1 - XiZ) 55 1 - <r lZ + <r 2 2 2 - ffn z n t 
and expand 

(3.02.7) 1/QOO B n(l - Xiz}- 1 m 14- S& 4 S 2 * 2 + , 

the coefficients S p of this expansion are symmetric functions of the roots : 

Si = xi -f #2 + 

(3.02.8) /S 2 = x\ + 0:1X2 + 



x 



n , 



the so-called "complete" symmetric functions. Since 



- 



on comparing coefficients, one obtains 



Si 



(3.02.9) 



03 



<r\ = 0, 
a t = 0, 

0*3 = 0, 



Of the three sets of symmetric functions, each set can be expressed in 
terms of the others by means of these equations. 

3.03. Vandermonde Determinants. It is easy to write in determinantal 
form an equation having specified roots. To illustrate for the case n = 3, 
if xi, # 2 , and x* are all distinct, the equation 



(3.03.1) 



1 


1 


1 


1 


Xi 


Xz 


#3 


X 


rr 2 
*i 


rr 2 

C 2 


rr 2 

^3 


ffl 


x i 


*2 


*tx o 


X* 



= 
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has these and only these roots. The determinant, therefore, whose 
expansion is a cubic polynomial in x is equal to some constant times 
(x Xi)(x #2) (x Xz) by application of the factor theorem. If we 
were to regard Xz as the variable instead of x, and apply the factor theorem 
again, it appears that (xz x z ) and (xz Zi) are also factors of the 
expansion of the determinant. Likewise, regarding xz as the variable, 
(xz Xi) appears as an additional factor. Hence the determinant is 
equal to the product (xz x 2 )(xz Xi)(xz x\)(x Xz)(x Xz)(x Xi), 
possibly multiplied by some factor as yet undetermined. However, the 
determinant is a cubic polynomial in x, and so has no other factors con- 
taining a;; it is also a cubic in #3, and so can have no other factors contain- 
ing Xz' } nor by the same rule can it have other factors containing Xz or x\. 
Hence any factor not yet found is a constant, independent of x or any 
of the Xi. But the expansion of the determinant contains the term xzx\x* 
once from the principal diagonal, and the expansion of the product 
contains this product also. Hence there is no other factor, and 



xi 

2 
1 



Xz 



1 

Xz 
Z n.Z 

^z ^a 
xl X\ 



1 

X 



- xi)(x - 



- xz)(x 



The coefficient of x* is 



1 

X\ 



1 1 



/v. 
0/ 



Such a determinant is called an alternant, or an elementary Vandermonde 
determinant. The negative of the coefficient of x z is 



1 

Xi 
X s , 



1 1 



x\ 



Again, the coefficient of x is 



1 



x\ 



I 

r j 
*a 

x\ 



1 1 



1 

Xz 



1 

x\ 
x\ 



xi xi 



1 1 



and the negative of the constant term is 



X\ Xz Xz 

/y*2 /y2 /y2 

x l ^2 X 3 

8ft 9 

/YO /V*O 

*/i */ va 



I 1 1 

x\ Xz X 



X 



3 
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The equation whose roots are Xi, Xz x\ and x* j x\ is 
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1 1 

1 Xz x 
< 2iX\ x^ x 2 

Q r 2 ~3 ~* 

ox l x z x 



0. 



For if we call the determinant P(x], then P(#i) = P'(XI] 
P(XZ) = 0. Likewise the equation whose roots are jci, #2 
Xz = Xi is 





0, and 
Xi and 



1 



x\ 
x\ 



1 





1 



1 

x 



X 



= 0. 



These representations are easily generalized, but the notation becomes 
cumbersome. 

3.04. Synthetic Division. We consider now some further useful conse- 
quences of the remainder theorem. First we observe that P(r), which 
is the remainder after dividing P(x) by x r, is most readily evaluated by 
evaluating sequentially 



(a Q r -f 
[(a r + ai)r 



a 2 , 
a 2 ]r + 



with P(r) obtained as the final step. This process can be systematized 
by writing the system 



Q>\ 



a 



|r 



where bo = do, an( ^ m general every number along the bottom row is the 
sum of the two above it. The r is written in the upper right-hand box 
merely as a convenient reminder. 

Having written this, we now observe that the 6's are the coefficients 
of the quotient 



Q(x) = 



+ b_i. 



One way to see this is to note that, when in ordinary long division we 
divide P(x) by x r, the 61 is exactly the remainder we get after dividing 
+ a by x r, the 6 2 is the remainder after dividing oooj 2 -f a\x + a 2 , 
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and so on sequentially. We have written above merely a scheme for 
evaluating these remainders in sequence. 

If the coefficients of the equation P(x) =0 are all integers, it is possible 
to obtain all its rational roots by inspection and a few synthetic divisions. 
This is a help even when the rational roots are of no interest for them- 
selves, since for every known rational root the degree of the equation 
can be lowered by one. If we examine the scheme for synthetic division, 
we can see that, if r is an integer and the a's are all integers, then the b's 
are all integers. If r is a root, then R 0, so that a n = b n -\r. Hence 
r is a factor of a n . Thus if the equation has any integral root, the root is 
a divisor of the constant term. More generally, if P(x) = is a poly- 
nomial equation with integral coefficients, and if p/q is a rational root in 
lowest terms, then p is a divisor of the constant term, and q is a divisor 
of the leading coefficient. 

For suppose r is a fraction p/q in lowest terms. If bor is a fraction, say 
with denominator s, then 61 is a mixed number whose fractional term 
has the denominator s. But s cannot divide p since p/q is in lowest 
terms. Hence b\r is certainly fractional. By continuing to the end, we 
conclude that p/q cannot be a root if q does not divide a<>. If we apply 
the argument to the reciprocal equation (3.01.8), we conclude that p 
must divide a n . Since there are only a finite number of possible choices 
for p and q, these can be examined one by one. 

In some of the numerical methods of evaluating roots of polynomial 
equations, and for other purposes too, one often starts with a polynomial 
P(x), replaces x by z + r, and wishes to evaluate the coefficients of the 
polynomial P(z + r) as in 3.02. For example, r might be a close 
approximation to a desired root of P(x) = 0, and we wish to replace the 
equation by one in z for which the desired root z is as small as possible. 
This is done in both Homer's method and in Newton's method, which will 
be described later. 

As in deriving (3.02.2), we write 

P(x) = CQ(X r) n + Ci(x r) n ~ l + + c n -\(x r) + c n , 

where c n = P(r), so that c n is the remainder after dividing P(x) by x r. 
Again if we write 

P(x) = (x - r)Q(x) + c n 

= (x - r)[c (x - r)"- 1 -f + c n _J -f c, 



it is clear further that c n -i is the remainder after dividing the quotient 
by a; r, .... Hence we extend our synthetic division scheme as 
follows: 
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do Q>\ 

6 or 



d n 2 



fco 


bi . . . 


f>n-2 

6^_ 3 r 


6 i* 
n-2' 


C n 



1 ... O n 

Iff 

o r . . . 



" 



C n -l 



. . . C n 2 



At each division we cut off the final remainder and repeat the syn- 
thetic division with the preceding coefficients. This is sometimes 
called reducing the roots of the equation, since every root 2, of the equa- 
tion P(z + r) is r less than a corresponding root Xi of P(x) = 0. 

In solving equations by Newton's or Horner's method, it is first 
necessary to localize the roots roughly, and the first step in this is to 
obtain upper and lower bounds for all the real roots. If in the process 
of reducing the roots by a positive r the &'s and the c of any line are all 
positive, as well as all the c's previously calculated, then necessarily 
all succeeding 6's and c's will be positive. Hence the transformed equa- 
tion will have only positive coefficients and hence can have no positive 
real roots. Hence the original equation can have no real roots exceeding 
r. Hence any positive number r is an upper bound to the real roots of an 
algebraic equation if in any line of the scheme for reducing the roots by r 
all numbers are positive along with the c's already calculated. 

3.05. Sturm Functions; Isolation of Roots. The condition just given is 
sufficient for assuring us that r is an upper bound to the roots of P = 0, 
but it is not necessary. In particular if all coefficients of P are positive, 
the equation can have no positive roots. This again is a sufficient but 
not a necessary condition. A condition that is both necessary and 
sufficient will be derived in this section. In fact, we shall be able to 
tell exactly how many real roots lie in any interval. However, since it 
is somewhat laborious, some other weaker, but simpler, criteria will be 
given first. 

Suppose r is an m-fold root so that 

= (x - 



Since P (m) (r) ^ 0, there is some interval (r , r + e) sufficiently 
small so that P (m) (z) is non-null throughout the interval, and P (m ~ l) (x), 
. . . ,P'(x), P(x) are non-null except at r. Suppose P (>n) (r) > 0. Then 
p( w *- 1 )(ic) is increasing throughout the interval, and so it must be nega- 
tive at r , positive at r -}- e. Hence P (m ~ 2) (#) is decreasing, and hence 
positive, at r ; increasing, and hence again positive, at r + e. By 
extending the argument, it appears that the signs at r and at r + 
can be tabulated as follows: 
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r - . . . - + - 4- 

r + . . . + 4- 4- 4- 

If P (fn) (r) < 0, then every sign is reversed. If we count the variations 
in sign in the two sequences, we find that at r there are m variations, 
for P and P' have opposite signs and present one variation, P' and P" 
have opposite signs and present another, .... On the other hand, at 
r -f- e all signs are alike, and there are no variations. Hence the sequence 
P, P', P", . . . , P (m) loses m variations in sign as one passes over an 
ra-fold root of P = in the direction of increasing x. 

Next, suppose r is an m-f old root of P (A) = but not a root of P (A-1) = 
(and it may or may not be a root of P = 0). Then P^+ wl > remains non- 
null and of fixed sign, say > 0, throughout some interval (r e, r + e), 
and from P (h) to P<*+*>, m variations in sign are lost. However, we must 
consider the possible variations P^- 1 ', pw, for the sign of P w may 
change, whereas that of P^- 1 * does not. But if in is even, the sign of 
P (A) does not change, so from P< /t - 1 > to P^+ m > there is still a loss of just m 
variations. If m is odd, P (W does change, so that from P<*- to P< A+m > 
there is a loss of either m + 1 or of only m I variations. In either 
event it is a non-negative even number. 

By considering every point r on an interval (a, 6), at which P or any 
of its derivatives may vanish, we conclude that, if V a and Vb are the 
numbers of variations in sign at a and b, respectively, displayed by the 
sequence P, P', P", . . . , then V a Vb exceeds by a non-negative even 
integer the number of roots of P = on the interval. 

This is Budan's theorem. In particular, V* = 0, while 

P (m) (0) = m!o n _ m . 

Hence the number of variations in sign in the coefficients is F , and this 
exceeds by a non-negative even integer the number of positive real roots. 
This is Descartes's rule of signs. In counting the variations, vanishing 
derivatives are ignored. 

This is sometimes sufficient to give all the necessary information. 
Thus if there is a single variation or none, there will be one root or none; 
if an odd number, there is at least one. 

Exact information, however, is always given by Sturm's theorem. In 
the following sequence set 

Po = P, Pi = P' 

for uniformity. Divide P by PI and denote the remainder with its 
sign changed by P 2 . Divide Pi by P 2 and denote that remainder with 
its sign changed by P 8 , .... The polynomials P , PI, P 2 , . . . are of 
progressively lower degree, and the sequence must therefore terminate: 
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Po = QlPl - P 2 , 

PI = QzPz Pa, 

(3.05.1) .......... 

*m2 = (JmVml * mj 

n 1 = 



(It is understood that any constant 5^ divides any polynomial exactly.) 
Now P m is the highest common factor of P and Pi. For by the last 
equation, P m divides P m -i ; by the one before, since P m divides both itself 
and P OT -i, it divides P m -2,* by the one before that, it divides also P OT _3, 
.... Conversely, if p is any polynomial which divides both PO and PI, 
it therefore divides P 2 by the first equation; by the second, it divides 
P 3 , . . . , and by the next to last, it divides P m . Thus the statement is 
established. 

It follows that, if P = has any multiple roots, they are roots of 
P m = 0, and all roots of P m = are multiple roots of P = 0. Hence 
we can find all multiple roots by solving an equation of degree lower than 
the original, remove them from P, and continue with an equation of 
degree < n whose roots are all simple. 

We now suppose this to have been done in advance, that P = has 
only simple roots, and that therefore P m is a constant =^ 0. Then con- 
sider the variations in sign presented by the sequence P. Suppose 
P<(r) = 0, where < i < m. Then P+i(r) 7* 0. For if 

P<(r) = /Wr) = 0, 

then Pi and P i+ i have a common divisor x r. Since 
(3.05.2) 



it follows that P,_i has also x r as a divisor, and by continuing, we 
conclude that also PI and Po have the divisor x r. But then r is 
a multiple root of P = 0, whereas there are no multiple roots. Hence if 
P(r) = 0, then P*. i(r) ^ and also P,-i(r) ^ 0. 

Consider again (3.05.2). At x = r, P-_i = P.+I. Hence P_i and 
P+i have opposite signs at r and also throughout some small interval 
(r e, r + e). Hence whatever the signs of P at r e and r + e, these 
three polynomials present the same number of variations at r e as at 
r + e. 

Now suppose PoM = 0. Then Pi(r) 7* 0, and Pi() keeps a fixed 
sign in some interval (r , r + e). If PI > on this interval, then 
Po(r - c) < < P (r + e), while if PI < 0, P (r - ) > > P (r + e). 
In either case Po and Pi present one variation at r e and none at r + . 

Hence if V a and Vb are now the numbers of variations in sign at a and 6 
of the sequence P , PI, P 2 , . . . , Pm, then V a > V b if a < b, and V a - 
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is exactly equal to the number of roots of P = on the interval from 
a to 6. 

We have proved the theorem only for the case that all roots are simple 
roots. By modifying the argument slightly, it can be shown that it is 
true also when there are multiple roots, provided each multiple root is 
counted only once and not with its multiplicity. This is in contrast 
to Budan's theorem where a root of multiplicity m was to be counted 
m times. 

In the practical application of Sturm's theorem, if for any i < m, Pi 
can be recognized to have no real zeros, then it is unnecessary to continue. 
Moreover, the sequence can be modified to 



CoPo = 



1 - P 2 , 

2 - P 3 , 



where Co, Ci, . . . are positive constants. Thus if the coefficients of P 
are integers, one can keep all coefficients of all the P integers and, 
moreover, remove any common numerical factors that may appear. A 
convenient algorithm is the following: Let 60, fci, bz, . . . represent the 
coefficients of PI after removal of any common factor, and write the table 



do fti 
b bi 



where 



c 2 C 3 



r 9 
C P ~ 



tto 

60 



P = 0, 1, 



Obtain, next, the sequence 



Co 



c 2 



4 * 



from the sequences b and c' by 



Then these are the coefficients of P 2 . 

3.06. The Highest Common Factor. It is now easy to prove a theorem 
utilized in 2.06. Let Po and Pi be any two polynomials with highest 
common factor D. There exist polynomials q Q and qi such that 



(3.06.1) 



Potfo + 



as D. 



Suppose the degree of Pi is not greater than that of PQ. In deriving 
relations (3.05.1), we were supposing that PI = PJ, but this supposition 
is not at all necessary for those relations. The assumption was used 
only in the proof of Sturm's theorem. Hence for our arbitrary poly- 
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nomials we can form the successive quotients and remainders (this is 
called the Euclidean algorithm) and obtain finally their highest common 
factor P m D. Write these in the form 



Pz QiPi PO> 
+ Pz -- Pi, 
P 2 - Q*P 3 + P 4 = 0, 



Pm_2 "~ Qm-lPm 1 H~ Pm ** 0, 

and regard them as equations in the unknowns PZ, Pa, . . . , Pm, the 
coefficients Q being supposed known. The matrix of coefficients is unit 
lower triangular and has determinant 1. Hence P m is itself expressible 
as a determinant, in which PI and Po occur linearly, in the last column 
only. Hence the expansion of the determinant has indeed the form of the 
left number of (3.06.1), where qo and qi are polynomials, expressible in 
terms of the Q's. 

3.07. Power Series and Analytic Functions. A few basic theorems on 
series, and in particular on power series, will be stated here for future 
reference. Proofs, when not given, can be found in most calculus texts. 
Consider first a series of any type 

(3.07.1) 6 + 61 + bz + ' 
where the 6 are real or complex numbers. Let 

(3.07.2) s n = 6 + 61 + +&. 

represent the sum of the first n + 1 terms. The series (3.07.1) converges 
to the limit s, provided lim s n = s t that is, provided for any positive e 



n 



there exists an N such that |s n s| < whenever n > N. A theorem 
of Cauchy states that the series (3.07.1) converges if and only if 

lim |s n+p s n \ = 



n 



for every positive integer p. In particular s n +i s n = &n+i, so that the 
theorem implies, with p = 1, that the individual terms in the series 
approach zero in absolute value. 

If a new series is formed by dropping any finite number of terms from 
the original, or by introducing any finite number of terms into the original, 
the two series converge or diverge together. 

The series is said to converge absolutely in case the series 

N + M + N + 

of moduli converges. If the series converges absolutely, then it converges 
since 

' + b n+p \ < 
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If a series is absolutely convergent, its terms can be rearranged or asso- 
ciated in any fashion without affecting the fact of convergence or the limit 
to which it converges. This is not true of series which do not converge 
absolutely. Also if for every n it is true that \b' n \ < |6 n |, then the series 



converges if (3.07.1) converges, and it converges absolutely if (3.07.1) 
converges absolutely. We shall say then that the series (3.07.1) domi- 
nates the other. 
Since 

1 + x + x* + + x n = (1 - 



identically, it follows that the geometric series, obtained by setting 
bi SB yx\ converges absolutely for any 7 and for any x satisfying \x\ < 1. 
In fact, for this series 



and when \x\ < 1, this has the limit zero. 

If for a real positive ft < 1 it is true that lim |6 n +i|/|6 n | 0, then 

n * o 

(3.07.1) converges absolutely. For select any positive e < 1 . 
Then there is an N such that for n > N 



. 

Hence for any p 

\b n+p \/\b n \ < (ft + e)'. 
Hence the series 

\b n \ + I&M-I + = N(l + l&+i|/N + \b n+ z\/\b n 



converges since it is dominated by the terms of the geometric series. 
Hence (3.07.1) converges absolutely. 

On the other hand, if for a real positive ft > 1 it is true that 



lim l6H.il/IW - P, 



n 



then the series diverges. 

If for some positive < 1 it is true that lim |6 n | 1/w = ft, then the series 

n * 

converges absolutely, but if /3 > 1, it diverges. 

Any convergent series has a term of maximum modulus. For since 
the sequence of terms b n has the limit zero, for any e there is a term b N 
such that all subsequent terms are less than e in .modulus. Choose less 
than the modulus of some term in the series. Among the N + 1 terms 
bo, ... } btf, there is one whose modulus is not exceeded by that of any 
other of these terms, nor is it exceeded by the modulus of any b n for 
n > N. Hence this is a term of maximum modulus. 
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When the terms 6 of the series (3.07.1) are functions of x, the limit, 
when it exists, is also a function of x, and we may write 

(3.07.3) f(x) = 6 (*0 + &i(*) + b t (x) + - . 

The series converges at x to the limit f(x) in case for any e there exists an 
N such that \f(x) s n (x)\ < e whenever n > N. Clearly N depends 
upon 6, and the smaller one requires e to be, the larger N must be made. 
In general N will depend also on x. But if for every x in some region the 
series converges, and if, moreover, for every e there is an N independent 
of x in that region, then the series is said to be uniformly convergent in 
the region. 

If every 6(#) is continuous, and the series is uniformly convergent in 
some region, then f(x) is continuous in that region. To show that f(x) is 
continuous at XQ in that region, one must show that for any positive c 
there is a 8 such that \f(x) f(x )\ < e whenever \x x*\ < 5. Any 
finite sum s n (x) is continuous, whence there exists a 6 such that \a n (x) 
s n (x Q )\ < e/3 whenever \x x \ < 5. Let N be chosen so that \f(x) 
s n (x)\ < e/3 for all x in the region whenever n > N. Hence 

\f(x) -/(* )| < MX) - .(* )| + \f(x) - S n (x)\ + |/(* ) ~ S(*0)| < . 

In case 

(3.07.4) f(x) = a 4- a& + a 2 z 2 + , 

and the series converges for any XQ, then the series converges absolutely 
and uniformly for all x satisfying |a?| < |# |. For the series f(x Q ) has a 
maximal term. Let this be 7. Hence for every * 

(3.07.5) |a<4| < 7. 
But if \x\ < \XQ\, then since 



the geometric series y^x/xo^ converges and dominates the series for /(a?). 
Also any N that is effective for the series (3.07.4) when x = x is a fortiori 
effective when \x\ < \XQ\, whence the series is uniformly convergent for 
\x\ < \XQ\. Since every term is a continuous function of x, therefore f(x) 
is a continuous function of x. 

If the series (3.07.4) diverges for any x = XQ, then it diverges for all x 
of greater moduli. For if \Xi\ > |#o|, and the series converged at xi, then 
by the theorem just proved the series would converge at xo, contrary to 
hypothesis. Hence if the series converges for any x 7* 0, it either con- 
verges for all x, or else there is some circle about the origin such that the 
series converges throughout the interior and diverges at every point out- 
side the circle. The behavior at points on the circle can only be deter- 
mined by further study. This circle is the circle of convergence of the 
power series. 
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From (3.07.5) it follows that, if the series (3.07.4) converges for x = #o, 
then for every i 

(3.07.6) \0i\ < 7/M , 

where 7 is the modulus of the term of maximum modulus. 

We have seen that f(x) denned by (3.07.4) is continuous throughout its 
circle of convergence. It is also differentiable throughout the same 
circle, and 

(3.07.7) f(x) - oi + 2a*x + 3a 3 * 2 + . 

To prove this, let r = \x\ < R, where R is the radius of the circle of con- 
vergence; let on = |oj| ; and let 

F(r) = ao + atf + a 2 r 2 -{-. 

If < 8 < R r, then the series F(r) and also the series F(r -f- 5) both 
converge. Hence the series 

d~ l [F(r + 5) - F(r)] = i + 2 (2r + 5) + <* 3 (3r 2 + 3r5 + 3 2 ) + 
converges. If \h\ < 5, 
h~ l \f(x + h) - /(*)] = 01 + a z (2x + h) + a 3 (3rc 2 + Bxh + ft 2 ) + , 

and this last series is dominated by the previous one. Hence as a series 
in functions of h for a fixed x the latter series converges uniformly and 
hence defines a continuous function of h. But for h = we have the 
series (3.07.7). 

Thus a function defined by a power series (3.07.4) is continuous and 
differentiable throughout its circle of convergence, and the series (3.07.7) 
for its derivative converges in the same circle. But the same can there- 
fore be said of /', so that /" exists, as does /'", . . . , and for each the 
series converges in the same circle. The function / is said to be analytic 
in the circle within which its power series converges. Since 

/<>(0) = n\a n , 
the series can be written in the form 

(3.07.8) f(x) = /(O) + xf'(V) +* 2 /"(0)/2! + , 

and this is its Maclaurin expansion. By a change of origin one can also 
write the more general Taylor expansion 

(3.07.9) f(x) = /(r) + (x- r)f(r) + (x - r)V"(r)/21 + 

already given for the polynomials. 
The series 

(3.07.10) F(x) = aox -f i^ 2 /2 + a 2 x*/3 + - 

formed by integrating each term from to # has a radius of convergence 
at least as large as that of /. For if x 7* 0, the series x~ l F(x) is dominated 
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by /(#). However, since / = F', the radius of convergence of F can be 
no greater than that of /, as we have seen. 
If f(x) is an analytic function, the equation 

(3.07.11) /(a?) = 

will be said to have a root r of multiplicity m in case (x r}~ m f(x) is 
analytic at r but (x r)~ m ~ l f(x) is not. But from the expansion (3.07.9) 
it appears that this will be so if and only if 

/( r ) =/'(r) = . . =/( D(r) 



Hence, in that case 

f(x) = (x 

In particular, the equation 

/(*)//'(*) - 

will have only simple roots, if any. 

Budan's theorem holds in the case of an analytic function provided, 
for some k, f (k) (x) keeps the same sign throughout the interval from a to 6. 

3.08. Kdnig's Theorem. Consider any function 

(3.08.1) /(z) = o + a& + a 2 z 2 + 

for which the expansion converges in some circle about the origin. Sup- 
pose that within this circle /(z) has one and only one zero a, which is 
simple. Let g(z) be analytic throughout the circle and g(a) 7^ 0. Then 
the expansion 



(3.08.2) g/f = h(z) = h + hiz + h&* + 
converges for all |z| < |a|, while the expansion 

(3.08.3) (a - z)h(z) = F(z) = /c + k& + /c 2 z 2 + + k r z r + 
converges throughout the circle. Then for z\ < \ot\ 

(3.08.4) (a - z)(h 9 + htf +)= fc + k& + 

so that 

at/to == ^o, 

(3.08.5) 7 h '. + ahl = . kl : 

h v -i + oth v = k v . 
On multiplying these equations by 1, a, a 2 , . . . and adding, one obtains 



Let 

(3.08.6) F v (z) m fc + fciz + ' + k^ = F(z) - R v+ i(z). 
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Then 

(3.08.7) h, - < 

and 

(3.08.8) 

However, F is analytic at a, and the series (3.08.3) converges for z a. 
Let p' be the radius of convergence of this series and let p satisfy |a| < 
p < p'. Then the series converges for z p. If y is the modulus of the 
term of maximum modulus of the series F(p), then 

N < 7/P". 
By (3.08.8) 

a - hr/hv+i - k v +ia* +<i /F v +i(a). 
Hence 

(3.08.9) |a - ftvAn-il < n\oc v+l /p' +l \, 

where M is some positive quantity depending upon the value of 7, a, and 
F(a). Since |/p| < 1, this proves that 

lim h v /h v +\ = ot 

V 00 

and shows, moreover, that the convergence is geometric with a ratio 



Konig's theorem has an important extension to the case in which f(z) 
has exactly n simple zeros within some circle about the origin. The 
extended theorem and its proof are sufficiently well illustrated by the case 
n 2, and this will now be given. Let the zeros be ai and 2 , and sup- 
pose that within some small circle about the origin /(z) has these and no 
other zeros. Take g and h as before but with g(ai) 7* and g(az) 5* 0. 
Let 



(3.08.10) P(z) 6 (1 - 
and now take 

(3.08.11) P(z)h(z) = F(z) - 

Then F() is analytic throughout the circle. We set 

(bo + biz + b z z z )(ho + h\z 
so that 



H~ b\hi + 60^2 = 
(3.08.12) 6 2 ^! + bfa + 60/13 = 
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If we multiply these equations by 1, a,, a?, . . . , a,-, and add, we obtain 



where F v (z} is defined as in (3.08.6). Likewise 



and 



These equations hold for i = 1 or f = 2. Multiply the first of these 
equations through by a?, and the second by a t . The equations then show 
that the determinant 



v i h t 



= 0, 



since they express the linear dependence of the three columns. 

Now if p' is the radius of convergence of (3.08.1 1), p satisfies |a| < p < 
p' for i = 1 and 2, and y is the modulus of the term of maximum modulus 
of the series F(p), then 



4- k-/p| 



> |i|, then 



If 



and, a fortiori, 



Let a = a? in case 



Then if the equation 
(3.08.13) 



.) I < Tk 



, otherwise a = i, and set 



At = | 2 /p|. 

h v hi 



h 



= 



is expanded to the form 



= 0, 



where the coefficients represent the cofactors of the powers of z in the 
determinant (3.08.13), then each of these coefficients differs by a factor 
of less than MH V+I from the coefficients of a quadratic 



Co2 
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satisfied by a\ and a 2 . Hence in the limit for large v the quadratic 
(3.08.13) is satisfied by the two roots a\ and a 2 of smallest modulus of 
/ = 0, if such exist within the circle of convergence of the expansion of /. 
In the general case of n roots the equations of the sequence are 



(3.08.14) ""+i > - - "-n +2 * = 



Hy hyl 
Aiy_|_l fly 


. . . h v n -\-\ Z n 
r. ^ n i 
. . . llv-n+2 Z 


h v +n /fcv+n 1 


. . . fly-f-l 1 



3.1. The Graeffe Process. We turn now to methods for solving a sin- 
gle equation in a single unknown. We have seen that one can express the 
sum s p of the pth powers of the roots of an algebraic equation as a rational 
function of the coefficients of the equation by relations (3.02.5). But 
we can write 

4- 



Hence if there should be one root that is larger than all the others, say a?i, 
then for a sufficiently large p all fractions within the parentheses should 
become negligible, and we would have approximately 



o 
e>p 



and in particular 

lim 

j* 

p-+ 00 

Hence if a feasible method could be found for computing s p for sufficiently 
large p, we could take the pth root of this and obtain thereby an evalua- 
tion of the largest root (in case there is such) of the equation. 

The Graeffe process does this, and somewhat more. If we write the 
equation in the form 



(3.1.1) 

and square both sides, we obtain 



or 

a^ 2n + (2a a 2 - a 2 )a: 2n - 2 + (2a O4 - 2aia 3 + aftx**-* += 0. 
Since only even powers of x occur here, this can be written 
(3.1.2) a*y n + (2a a 2 - a 2 )^"- 1 + (2a a 4 - 



+ = 0, 
where 

y = x*. 
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Hence we obtain a new equation whose roots are the squares of the roots 
of the original equation. If we repeat, we obtain an equation whose roots 
are the fourth powers, another repetition gives one with the eighth 
powers, etc. After p such operations we obtain an equation whose roots 
are the 2 p th powers of the roots of the original: 

(3.1.3) aSxfo + a?^- 1 + = 0. 

At any stage if we write the coefficients in sequence 

a[ p) a l f a ( f a% . . . , 



then to get the new sequence ap +l> we take the product of a p) by the 
coefficient symmetrically placed with respect to atf* and double, subtract 
the double product of a{ p) by its symmetric mate, . . . , ending with 
aj p>s . Now if the roots are x i} then 



(3.1.4) ai/Oo = -So%, df/atf = -So? 8 ,, . . . , a^/ajP = -Sof. 

If the roots are all distinct, and Xi has a modulus larger than that of any 
other root, then eventually 



(3.1.5) . -<>/a<* = xf. 

Now it is also true that 



so that by a similar argument we can take eventually 

i 





'" 



oif/aff. = -xf, 
if the modulus of x z exceeds that of every other root except for Xi. Again 



(3.1.7) 



If the equation has only simple real roots, all relations (3.1.7) are valid 
for sufficiently large p. The signs of the roots are undetermined, but 
these can be obtained by substitution or in other ways. But if P(x) is 
real, and the equation has complex roots, these occur in complex conjugate 
pairs with equal moduli, and there may be any number of unequal roots, 
all having the same modulus. For example, all n roots of 

x n - 1 = 
have unit modulus, and the method fails. 
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If there is one pair of complex roots whose moduli exceed the moduli of 
all others, 

\xi\ = \xt\ = P > N (i > 2), 

then in polar form 

Xi p exp id, 
x z = P exp ( id), 
so that for m = 2 P 

x = p m exp mid, 
x = p m exp ( mid), 
and 

x + x% = 2p m cos mO. 

Hence for larger values of p, cos md will fluctuate in value and even in 
sign, causing a ( 1 p) /o p) to do likewise. However, in a ( f/a ( f the dominant 
term will be 



If we can be sure that we stop where cos md is not too small, then X 1 ? + x% 
will dominate the other terms in aff/a,^, and we can obtain both p and 
d, but with the quadrant of d undecided. 

This indeterminacy can be resolved if we apply the root-squaring 
process to the equation 

P(y + h) = 

as well as to the original equation, where h is a small fixed number (using 
the method of 3.04 to obtain the coefficients of y). Each root i/ of this 
equation is related to a root of the original equation by 

y { - Xi - h, 

and if h is small enough, the moduli of y\ and 3/2 will also exceed the 
moduli of all the other roots. If 

= \y 



*, 

then our roots x\ and Xz lie in the complex plane where the circle of radius 
p about the origin intersects the circle of radius a about the point h units 
to the right of the origin (or h units to the left if h is negative). This 
determines d uniquely. In case there are other roots Xi with the same 
modulus, the corresponding roots y< will have different moduli, and this 
difficulty is thereby removed. 

3.11. Lehmer's Algorithm. The technique of investigating the roots 
2/ of P(y -f K) = 0, along with the roots Xi of P(x] = 0, has the dis- 
advantage of requiring two applications of the Graeffe process in addition 
to the special computations involved in the determination of the coeffi- 
cients of y. Moreover in selecting h, one should be careful to make it 
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small enough so that, if roots rc< and x, are such that \Xi\ > \Xj\, then also 
\Xi h\ > \Xj h\. 

Brodetsky and Smeal therefore make the natural proposal that h should 
be ''infinitesimal," and Lehmer has developed an effective algorithm. 

The original Graeffe process can be described in slightly different terms 
by saying that we start with a polynomial P(z) and obtain from it a 
polynomial PI(X) whose zeros are the squares of those of P ; from PI we 
obtain Pz(x) whose zeros are the squares of those of Pi and hence the 
fourth powers of those of P, .... On setting P = P for uniformity, 
one verifies that 

(3.11.1) Pp+i(*) - P f (\G)P,(- V*), p = 0, 1, 2, .... 
In fact 

P (x) = a$l(x 



= agn( -* + *?), 

and the general statement follows by a simple induction. 
If we write 

Qo(x) = a U(x Xi h), 

Qi(x) = ajn(V - Xi - h)(- V* - Xi - h) = Qo(V^)Qo(- V*) 



and continue the same procedure with 

then we find inductively that Q p (x) is a polynomial whose zeros are the n 
quantities 

(x t + ti) m = x? + mhx?- 1 + , m = 2 P , 

where the terms omitted contain h 2 and higher powers of h. 
Lehmer's algorithm is obtained by setting 

(3.11.2) (z, h) ** (x - ft)-"Po(* - h) 

and defining recursively 

(3.11.3) <p-f.i(rc, h) = <t> P (V x > ft)^p(~ V*> ft)- 
Then 

(3.11.4) Q,(a?) = (ft m - )"*,(*, ft). 

Also 

</>o(a;, ft) = <#>o(a;, 0) h<j>' Q (x, 0) + 



- a 
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where 

<t>' = d<f>/dx = -d<j>/dh, 
and 

6 r = (r - l)a r _i, r = 2, 3, .... 

By direct calculation from (3.11.3) if 

(3.11.5) 4>^i(x, h) = o^*-" + o^-ar + a< 

-f 



we obtain the recursion 

r-l 

<*+ = (-l)'a}" 1 + 2 (-IVaWagL,, r - 0, 1, . . . , 
(3.H.6) ar _ 1 '- 

L,, r - 1, 2, ---- 



If ao = 1, as we may suppose, then a p) = 1 for every p. From 
(3.11.4), Q p and ( x) n <t> p differ only by terms containing the factor W. 
Hence the coefficient of a;" 1 in < p is the sum of the zeros. But the zeros 
are of the form 

(xi + K) m = xf 
Hence 



But if there is a root of largest modulus x\ t then for p sufficiently large it 
follows that 

(3.11.8) xi = <*f/Vp, 

approximately. Thus we obtain the solution directly without the need 
for a root extraction. Again if 



then 



whence 

(3.11.10) a; 2 * 
In like manner if 

M > 

then 

(3.11.11) x, * 
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There are many special cases that can occur, and effort will be made, 
not to enumerate them all, but only to suggest how they can be treated. 
If #1 is a root of multiplicity k and 

\Xi\ > \X k+ i\ > ' ' , 

then 



so that (3.11.8) still holds. Moreover, 




- 21 ^ Ir 2 " 1 - 1 
Z \2/ Xl ' 



_ 1 \k+\ n (p) JL. /y.fcm 

-U +! x l 

/ _ T\Ar-fl/i(p) -- / r A;w/ r m 1 I 

v J^^ ^-hi x i x k+i i 
Hence 



Analogous relations can be worked out for a multiple root of intermediate 
modulus. 

The most important case of unequal roots of equal modulus occurs for 
a real polynomial with complex roots. Suppose, for example, 



Then if P(x) is real, and Xi j #2, we can write 

Xi = p exp (id), #2 = p exp ( iff], 
x = p m exp (mid), x^ = p w exp ( miff). 
Since 

exp ( id) = cos i sin 0, 



( 1 * )) will contain the term 2p m cos m6 which will oscillate in value with 
increasing p and m but will dominate the other terms whenever m6 is not 
too far from an integral multiple of ir. When this is the case, we may say 

a (p) -2 p m cos me, 
b ( f = -2p m ~ l cos (m - 1)6. 
Also 
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Since p is real and positive, this is obtainable from a ( } by a root extraction, 
and then $ can be found from bp } or, in fact, from b{ p) or a ( f. In any 
event there is no ambiguity since the two possible quadrants for corre- 
spond to the two roots x\ and a? 2 . 

3.12. Transcendental Equations. If one sets z = or 1 , an equation 
equivalent to P = is the equation 

(3.12.1) f(z) s o + aiz + <z 2 2 2 -f = 



in z. Each formula referring to the application of the Graeffe process in 
either the original form or that given by Lehmer remains valid if each 
x root Xi is replaced by z^ 1 , the reciprocal of a z root of (3.12.1). But 
when this is done, they are applicable also to the case when / is analytic 
and not necessarily a polynomial. That is to say, if among the roots of 

(3.12.1) which lie within the circle of convergence there is a root z\ whose 
modulus is less than that of all the others, then 

(3.12.2) lim ap } /a?' = -2?', 



which corresponds to (3.1.5), and 

(3.12.3) lim Vf/a { f = 21, 



J- 00 



which corresponds to (3.11.8). Polya's demonstration of the formulas 
such as (3.12.2) will now be sketched briefly. 

Suppose that the series f(z) converges inside a circle of radius p' at least 
and that the equation has exactly n roots, z\ t 22, . . , z n , of moduli less 
than p'. Choose p and designate the roots so that 

l*i| < M < < kl < P < P'. 

It is no restriction to suppose that ao = 1 . For any positive integer m, 
let 

w = exp (2Tri/m) cos (2-ir/m) + i sin (27r/m). 

Thus w is a complex mth root of unity, and one can verify that the other 
complex roots are to 2 , . . . , u m ~ l and that 

1 + w -f to 2 -f + aT- 1 = 0. 
This done, one can verify that the product 

(3.12.4) f(z)f(uz) - f(u m ~ l z) S 1 -f a 



contains only powers of z m . The Graeffe process takes advantage of this 
in the special case when m is some power of 2. The theorem states that 

(3.12.5) lim *ffi *~a, m - (-!). 



f|r-> 00 
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If we write 

/ = (Z - Zi)(z 

(3.12.6) /'// = (z - z 1 )-* 
where 



then the series <f> f /<f> has a radius of convergence at least p'. Hence for 
some y 

(3.12.7) |6 n | < 7 |p- n |. 

On integrating (3.12.6) from to z and taking the antilogarithm, 

(3.12.8) /(*) = (1 - z/zi) (!- */*.) exp (b* + 



(3.12.9) /()/() /(co ) - (1 - ~/*r) 

(1 - 
where 

1 + B l>m z m + B*. m z* m + - = exp (b m z m 



To express the coefficients B p , m in terms of the b pm , consider the related 
problem 

1 + Aiy + A 2 !/ 2 + ' = exp (cay + a z y*/2 +*) 
By differentiating both sides with respect to y, we get 



Hence on multiplying and comparing coefficients, we obtain the recursion 



Hence 

m , 



It follows immediately from the first of these relations arid (3.12.7) that 

\B l>m \ < -ylp-l, 
and if y > 1, as we may require, one can show inductively that 
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Now from (3.12.4) and (3.12.9) 
(3.12.10) 1 + a l<m z m + at, m z* m + 



Hence, by comparing coefficients of z nm , 



But 




ntn 



iyjn <9?ft| ^^ *\/m\'y / /*p 
."*! 2 n | 2i 7 |*n/P| 

For fixed n, as m increases, the first term vanishes as \z n /p\ m , the second 
as |2 n /p| 2w , .... Hence 

(3.12.11) 272? za n . m = (-1)" + 0(|* n /ph). 

This is the required theorem. 

3.2. Bernoulli's Method. The Graeffe process has the decided advan- 
tage that the exponents m = 2 P themselves build up exponentially. 
Hence if one is fortunate enough to have the roots reasonably well 
separated in the original equation, he may hope that only a relatively 
small number of root squarings will be required. It has the further 
advantage that, in principle at least, it yields simultaneously all the 
roots of an algebraic equation and all roots within a circle of analyticity 
when the equation is transcendental. Hence, once the squaring has been 
carried sufficiently far, all solutions are obtainable by simple division or, 
at worst, by root extraction. 

The methods now to be described do not converge nearly so fast ; they 
give only one root, or at most a few roots, at a time; and in some cases 
they require some previous knowledge of the approximate locations of the 
roots to be determined. Nevertheless, they all have one striking advan- 
tage. Errors, once made, do not propagate but tend to die out. If 
there were no round-off error, they would die out completely. A gross 
error might cause the process to converge to some root other than the 
one intended, however. But the self-correcting tendency suggests that a 
method of this type might be useful at least for improving approximate 
solutions obtained by, say, Graeffe's method, but with insufficient 
accuracy. 

Let the equation 

(3.2.1) f(z) == ao + a,z + 2* 2 -f = 
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have a single root a interior to some circle about the origin throughout 
which f(z) is analytic. Then if g(z) is analytic throughout the same 
circle, and g(a) j 0, it follows from Konig's theorem that h r /h r +i a, 
where 



(3.2.2) g/f =h 
Now if 

(3.2.3) g(z) = gr 
we can set 

go + 0i + = (o + iz 4- -)(^o + hiz + 
and compare coefficients to obtain the recursion 

cioho = go, 

/ 2 4\ ^ 1 + ai ^ = fl' 1 * 

-f* o>\hi + 02/10 = S'a, 



so that, if ao 5^ 0, the h r can be obtained in sequence. 
In the case of an algebraic equation 



P(x) ss aox n + aia;"- 1 + + a = 0, 

is also a polynomial, and the root a is the reciprocal of 
some root # t of P = 0. If g is taken to be some polynomial of degree 
less than n, then for v > n 



(3.2.5) ttoh v -|- a\h v -\ + + a n h v - n = 0. 

However, instead of first making a nearly arbitrary selection of g(z), one 
can just as well select ho, hi, ... f h n -\ arbitrarily and apply only 
Eq. (3.2.5). One never needs to know the function g explicitly. It 
might happen, by chance, that the selection of h , . . . , h n -i defines by 
(3.2.4) a polynomial g which vanishes at a, but if so, the sequence h r /h r+ i 
may converge to some other root. 

Bernoulli's method, properly speaking, is the method just described 
for an algebraic equation, though the usual derivation is somewhat differ- 
ent. If the roots x t of the algebraic equation P = are all distinct, then 
for any choice of ho, . . . , h n -i t the n equations 

Stti# h p , p 0, . . . , n 1 

can be solved for the u i} since the determinant \xf\ is a Vandermonde 
which vanishes only when two or more of the re,- are equal. If now 
h n , hv+i, ... are determined by (3.2.5), then 

h, 
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for every v. But if x\, say, is the largest root, and HI 7* 0, then 

h, = 



a. 



and for v sufficiently large, h, * u\x\, and hence h v /h v +\ = x^ 1 

To return to the general case where / may or may not be a polynomial, 
there may not be a circle about the origin containing only a single root. 
It may be, instead, that there are two conjugate complex roots a\ and 0% 
for which therefore |ai| = | 2 |, and which, however, lie within some 
circle which contains no other root. If so, we can apply the extension 
of Konig's theorem, computing the h, as before, but forming a quadratic 
equation (3.08.13), for v sufficiently large, whose roots will be a\ and 
2 approximately. More generally, if there are n distinct roots with 
equal moduli, (3.08.14) can be applied. However, it may be preferable to 
set z = y + u for some fixed u y and apply the method to the resulting 
equation in y. 

If there is some circle about the origin which contains only the root i, 
a somewhat larger one which contains only a\ and <* 2 , a still larger one 
containing only i, c*2, and 3 , . . . , then in principle one can obtain 
all roots i, a 2 , a 3 , . . . without a change of origin. Thus having found 
oti t we can apply the extension of Konig's theorem, and on setting 



h, 



-2 



(3.2.6) 
obtain 

Again if 

(3.2.7) 

then 



Aitken has given a convenient recursion for calculating the determi 
nants H*p of successively higher order. The formula is 



H = 



h v \ 
h, 



h, 



(3.2.8) 



Hf* - 



p - 1, 2, . . . , 



where for uniformity we set 



For p 1 the equation merely gives the expansion of the determinant 
(3.2.6). The proof is sufficiently well illustrated f or p = 2 and is based 
upon a classical determinantal identity. The sixth-order determinants 
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A-M-1 h r 

h, 

/ly-l-l U 

h+* 









h, 



h, h v 
h v +i h, 



1 



1 
1 



1 









h,-i -h,-i 



h, 

h 




















1 



1 



= 



are equal, since the second is obtainable from the first by subtracting the 
fourth, fifth, and sixth rows, respectively, from the first, second, and third. 
But the second one vanishes, since in the Laplace expansion by third-order 
minors every term contains a determinant with a column of zeros. The 
Laplace expansion of the first along the first three rows has six nonvanish- 
ing terms, but these are equal in pairs. When the three distinct terms 
are written out, one obtains 



h, 



h,-i 
h, 



1 



1 



i,_ 2 

tv-l 



i h v 

hy-2 



h 



1 






1 






hy 

h 
h 








= 0. 



On simplifying and rearranging, we obtain (3.2.8) for p = 2. The gen- 
eral case requires the expansion of a vanishing determinant of order 
2p -f 2 formed in a similar manner. 

Aitken first proposed his 3* process, described below, as a device for 
accelerating the convergence of the sequence H ( y p) /H ( ^ r If the sequence 

WQ, U\, Uz, . . . 

converges geometrically to the limit w, that is to say, if for some fc, 
\k\ < 1, it is true that 

u v u = k"(uo u}, 
then for any v 

i,-\ u v ,. _ o i \ _ 

The proof can be based upon the further property that, if 



u' = u + w 



u' v = u v 



the same identity holds when the quantities are primed. In fact, by 
direct substitution one finds that, when each term in the sequence is 
increased by o>, the entire quantity on the left is increased by w. We can 
therefore take w = u and consider the sequence whose limit is 0. 
But by direct substitution, then, the determinant is seen to vanish, 
which proves the assertion. In 3.08 it was shown that each sequence 
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u v 
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for fixed p converges geometrically (in the limit). 



Hence 



we may expect that the derived sequence 



v \ U v 



u v u v + 

would converge more rapidly than the original one. This is Aitken's S 2 
process. A second derived sequence, u ( ?\ can be formed from the u 
just as the u was formed from the u v . It is to be noted that in forming 
a term in a derived sequence one can neglect all digits on the left that 
are common to the three terms being used. This is because of the prop- 
erty that in increasing each term by w the result is increased by co. 

We conclude this section with a brief mention of an expansion due to 
Whittaker. In (3.2.4) we are free to take go = o, <7i = 2 = 0. 
If the first v + 1 of these equations are regarded as v + 1 linear equations 
in the v -f- 1 unknowns /i , /ii, . . . , h v , the solution for h v can be written 
down in determinantal form (cf. 3.32 below). Hence the ratio hy/h,,+i 
can be expressed as the ratio of two determinants. Moreover, one can 
write 

a = ho/hi + (hi/hz ho/hi) + (hz/hs hi/hi) -\- , 



and therefore a can be written as the limit of an infinite series involving 
quotients of determinants. A slight transformation yields 



a = ho/hi + (h\ hji^/hiht + (h\ hihz)/h 2 h 3 + , 

and the numerators in these fractions are second-order determinants 
whose elements are themselves determinants. One can now apply an 
identity of the same type as the one used to demonstrate (3.2.8) and 
obtain Whittaker's expansion : 



(3.2.9) a == - - - - 

Oi 



3.3. Functional Iteration. If \fr(x) has no pole which coincides with a 
root of 

(3.3.1) 
and if 



(3.3.2) *( 

then any root of (3.3.1) satisfies also 

(3.3.3) x 
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In particular if \fs(x) is analytic and non-null throughout some neighbor- 
hood of a root a of (3.3.1), then a. is the only root of (3.3.3) in that 
neighborhood of a. This suggests the possibility of so choosing ^ that 
the sequence 

(3.3.4) x i+l = <(*) 

converges to a, provided the initial point x is sufficiently close to a. 
In fact if the sequence (3.3.4) converges at all, it must converge to a solu- 
tion of (3.3.3), since clearly the sequences a;, and 4>(x t ) have the same 
limit. Consider first the conditions upon < that will ensure the conver- 
gence of (3.3.4). 

Define the p neighborhood N(x , p) of XQ by 

(3.3.5) N(x , p)\ x - X Q \ < p. 

That is, the p neighborhood of x is the set of all points x within a distance 
of p from XQ. Now if for some positive k < 1 and some p it is true that 



(3.3.6) |4>(O - <t>(x")\ < k for x' and x" in N(a, p), 

and if x itself is in N(a, p), then every x* in the sequence (3.3.4) is in 
N(a, p), and the sequence converges to a. For 

Xi+i - a = <t>(Xi) - <t>(oi), 
so that 

- a = <t>xi) - <K) < kXi - a 



and inductively every # lies in N(a, p). Also 

(3.3.7) \Xi a\ < k { \Xo a\ 

by an induction that can be carried out once we know that every # t lies 
in N(a, p). Since k < 1, therefore, the distance |jc,- a| decreases 
geometrically at least. 

Now suppose that for some # and p and a positive k < 1 we have 

(3.3.8) \<f>(x f ) - <t>(x")\ < k for a?', x" in N(x Q , p), 

the condition holding in a p neighborhood of Xo, while at XQ we have 

(3.3.9) |*o - 



We do not now presuppose the existence of a solution. Instead we show 
that the sequence defined by (3.3.4) has a limit a which lies in N(x , p) 
and which satisfies our equation. Hence the conditions (3.3.8) and 
(3.3.9) are together sufficient to assure us of the existence of a solution 
and that it is obtainable as the limit of our sequence. 

We show first that every term in the sequence lies in ATfao, p). We are 
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assured by (3.3.9) that x\ = <f>(xo) does, since this is equivalent to saying 
that 

Xi\ < (1 k)p < p. 



If also # 2 , &s,...,< all lie in N(x , p), then since 



and by induction 
therefore 



- Xi\ 4" \Xi - afi_.i| 4- 4- |a?i 
< (fc< -f /c 4 - 1 + - - + 1)(1 - k)p = (1 - ^ 



Hence the series 

\XQ\ 

converges, and hence the series 

XQ + (Xi XQ) + (fl?2 - 

converges absolutely. But the partial sums of the last series are the #,-. 
Hence the sequence (3.3.4) converges, and the limit therefore satisfies 
(3.3.3). 

If <f> is analytic in some neighborhood of a root a, as will be assumed 
throughout, and if 

(3.3.10) | *'(<*) I < 1, 

then for any k which satisfies 

< k < 1 



there exists a p sufficiently small so that (3.3.6) will be satisfied. Hence 
(3.3.10) is sufficient to ensure the existence of some neighborhood of a, 
though possibly small, within which (3.3.4) converges. However, it 
appears from (3.3.7) that the convergence is more rapid for smaller k, 
and hence for smaller |<'(a)|. Hence it would be advantageous, if pos- 
sible, to make <t>'(a) = 0. When this is the case, we shall say that the 
sequence (3.3.4) has second-order convergence or, more briefly, that the 
iteration (defined by) <j>(x) is a second-order iteration. More generally, 
if 



(3.3.11) <*/(<*) = *"() - - 0<"- l >(a) - 0, 

the iteration will be said to be of order m at least, and of order m exactly 
if ^ <m) (a) 5* 0. While one can write in general the expansion 



(3.3.12) <Kz) - - (x - a)0'(a) 4- (* - a) V()/2! 4- 
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(since <t> is supposed analytic), when the iteration is of order m, one can 
write 



(3.3.13) +(x) - a = (x - a)- 

In this case 

Xi+i a = (Xi a) m (m) ()/m! + . 

3.31. Some Special Cases. Some classical methods of successive 
approximation are methods of functional iteration as here understood. 
Horner's method is not, and since it has little to recommend it in any 
case, it will not be described here. 

3.311. First-order iterations. Best known of these is the regula falsi. 
This applies to real roots of real equations, algebraic or not. If f(x') and 
f(x") have opposite signs, 

** - [*'/(*") - *"/(*')]/[/(*") - /(*')] 

lies between x' and x". The chord from the point x', /(#') to x", f(x") 
intersects the x axis at #2 as one verifies easily. It is no restriction to 
suppose that f(xz)f(x") > 0, since otherwise we can reverse the designa- 
tions of x' and x". We now let Xz play the role of x" and repeat, or we 
let x" Xi and regard this as the first step in the iteration. Hence we 
are taking 

(3.311.1) *(*) = [*'/(*) - xf(x')]/(f(x) - 

The derivative <j>'(ci) is seen to be 

*'() = [/(*') + ( - *' 

In case / has continuous first and second derivatives near a, then the 
Taylor expansion gives 

/(*') - /(a) + (X 1 - a)/'(a) + Y 2 (x' - a) 
where now x" is some point on the interval (a, x'}. Hence 

/(*') + (a - *')/'() = 
and therefore 

- HO*' - aJ 



Hence for #' sufficiently close to a, <'() will be small, and there will exist 
an interval about a over which 



< * < i. 

Hence once we find an initial x' close enough to a, the subsequent itera- 
tion will converge to the solution. The choice (3.31 1 . 1 ) of </> is equivalent 
to the choice 

* . (x - x')/(f(x) - /(*')] 

in (3.3.2), as we verify directly. 
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Geometrically this method amounts to drawing a series of chords 
all passing through the point x' , /(#') The ith chord passes also through 
Xi-i, f(%i~i). Instead we might take a series of parallel chords. For 
this 

t s m, 

where m is the constant slope. Then 

4>(x) = x - mf(x), 
and 

+'(x) H 1 - mf'(x). 

If /'(#) has fixed sign throughout some neighborhood of the solution, we 
choose m to have the same sign and, in fact, such that throughout the 
interval 

2 > mj'(x) > 0. 

3.312. Second-order iterations. If we take 



(3.312.1) *(x) = a - 
which is to say 

(3.312.2) *(x) = l/f'(x), 

we obtain the well-known Newton's method. The derivative is 

(3.312.3) +'(x) = 1 - [/"(*) - f(x)f"(x)]/f'\x), 

whence 

*'() = 0. 
If a is not a multiple root, 

/'() * 0, 

and for any positive k there is a neighborhood of a throughout which 



The requirement often made that at the initial approximation x we 
should have 

/(*o)/"(*o) > 
is not strictly necessary. 

Newton's method applies to transcendental as well as to algebraic 
equations, and to complex roots as well as to real. However if the equa- 
tion is real, then the complex roots occur in conjugate pairs, and the 
iteration cannot converge to a complex root unless x is itself complex. 
But if XQ is complex, and sufficiently close to a complex root, the iteration 
will converge to that root. 

For algebraic equations, as each of the first two or three x t is obtained, it 
is customary to diminish the roots by #,- by the process described in 3.04. 



NONLINEAR EQUATIONS AND SYSTEMS 123 

Or rather, one first diminishes by # ; then one obtains Xi XQ and 
diminishes the roots of the last equation by this amount; then obtains 
#2 Xi and diminishes by this amount, etc. Since 

f(Xi + U) = /(*<) + uf(Xt) + ' ' ' , 



one has always that a?<+i # t is the negative quotient of the constant 
term by the coefficient of the linear term. Hence one calculates /(x,) and 
f(xi) in the process of diminishing the roots at each stage. 

However, /(a?,-) decreases as one proceeds. When /(#;)//'(#) becomes 
sufficiently small, one can write 

u = -lf(xi) + u>f"(x<)/2\ + ]//'(*;), 

which is exact. When u is small, the terms in u z , it 3 , . . . become small 
correction terms, and subsequent improvements in the value of u can be 
made quite rapidly by resubstituting corrected values. 
When Newton's method is applied to the equation 

z 2 - N = 0, 

the result is a standard method for extracting roots in which one uses 

4>(x) = (x + N/x)/2. 

3.32. Iterations of Higher Order: Konig's Theorem. If one applies 
Newton's method to any product q(x)f(x) to obtain a particular zero a of 
/(#), one always obtains an iteration of second order at least, provided 
only q(a) is neither zero nor infinite. Hence one might expect that by 
proper choice of q it should be possible to obtain an iteration of third 
order or even higher. This is true; one can in fact obtain an iteration of 
any desired order, and various schemes have been devised for the purpose. 
Some of these will now be described. However, one must expect that 
an iteration of higher order is apt to require more laborious computations. 
The optimal compromise between simplicity of algorithm and rapidity 
of convergence will depend in large measure upon the nature of the avail- 
able computing facilities. 

Consider first Konig's theorem. In the expansion 



= 9/f = ho 
we have Taylor's expansion about the origin in which 



If we move the origin to some point x, we can restate the theorem in an 
apparently more general form, as follows: 
If in some circle about x the equation 

(3.32.1) /() = 
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has only a single root a, which is simple; if /(z) and g(z) are analytic 
throughout this circle, and g(a) 7* 0; and if we define 



(3.32.2) P r (x) m 
then 

(3.32.3) lim P T (x) = a - x. 



I oo 



For P r (#) is simply the ratio of the coefficients of the Taylor expansion of 
h(z) about the point z. 

This being true, then at least for r sufficiently large it is to be expected 
that 

|a X P r (x)\ < \a x\, 

and hence x 4- P r should then define a convergent iteration of some order. 
It turns out that the iteration is convergent for any r and in fact is of 
order r + 1. 

In proof, we write the expansions 

h(z) m ho(x) + (z - x)hi(x) + (z - xyh*(x) + 
and 

F(z) = k (x) 4- (z - 3?)fci(aO ++(*- x)'k r (x) 
Then 

P r (x) = h r -l(x) fh r (x) 

= (a - x)[F(a) - R r (a, a;) 
and 

a - x - P r = (a - x) 

Since a x P r (x) has the factor (a x) r+1 , all derivatives of 

a x - P f} 

and hence of a; + P r (#), from the first to the rth will vanish at x = a. 
By definition, therefore, x + Pr(x) defines an iteration of order r + 1 
at least. Note that with g s 1, PI = ///', which yields Newton's 
method. 

For the functions h v (x) required in forming any P r (x), one can obtain 
them by differentiation, as indicated in the theorem, or by solving a 
recursion like that of (3.2.4). However, now the a, and g v are functions 
of x, coefficients in the expansions 

/(z) 33 a (aO + (z - x)ai(x) -f , 
) + (z - x) 



In the statement of Whittaker's expansion (3.2), reference was made 
to the fact that the h, could be expressed by means of determinants. 
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This is equally true for the h v (x), and even for the P v (x). For this it is 
convenient to suppose that any desired function g has been divided into / 
in advance, and / now designates the quotient. The expansion is now 

(3.32.4) !//(*) - fo(a?) + (z - x)hi(x) + (z - *)%(*) + - , 
so that 



and 
Let 



P r (a?) = hr-i(x)/hr(x). 



= 1, Ai = ai(x), 

CL\ flo 

#2 fli flo 



a r Or-l 0r-2 

fl/f """* y "" "~ J LJkf-Cl'Q ^ 



(3.32.5) A r = 

Then 

(3.32.6) 

and 

(3.32.7) P r = -a A r _!/A r . 
This is equivalent to saying that an iteration 

(3.32.8) < r = x + P r 
of order r + 1 is given by <f> r satisfying 



(3.32.9) 

As an example, let 
Then 



c/> r x flo 





1 a t a 





fl2 Ol 





a r a r -i 






= 0. 



/ = x m - N. 



= mx m ~ l 



a 2 = m(m - l)x m - 2 /2. 



02 



Oo = x m N, 
We have for third-order iteration 

<f> = X 

which becomes after simplification 

This defines Bailey's iteration for root extraction. For square roots 

I* 2 + AT). 
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3.33. Iterations of Higher Order: Aitken's 5 2 Process. Consider any 
sequence #, satisfying the iteration <J>(x), with the limit a. The sequence 

(3.33.1) Ui = Xi - ft 

for any fixed ft has the limit a ft and satisfies the iteration 

(3.33.2) *(ti) ^ *(tt + ft] - ft, 
since , 

X i+ l = -M t -+i + ft = <t>(Xi) = <t>(Ui + ft). 

In particular if ft = a, the sequence & = z t - a represents the deviations 
of the approximations a;,- from the limit a, and this sequence is defined by 
the initial deviation , together with the iteration 

(3.33.3) ($) s <( + a ) - a . 
If the iteration is of order r, then 

(3.33.4) ' 



Now let <#>(i) () and <(2)(#) represent any two iterations of the same 
order and not necessarily distinct. With these can be associated the 
iterations eo(i)() and W(2>() satisfied by the deviations. The function 



also defines an iteration. This is invariant in the sense that, if one makes 
the substitution (3.33.1), forms the ^a> and ^ (2) as in (3.33.2), defines 
as in (3.33.5) with ^ ( D, ^ (2 ), and w replacing < (1) , ^ (2 ), and a:, then 
s $(w + ft) ft. This can be verified by direct substitution. 
Now it"can be shown that, if the iterations <(D and <(2) are of order r, 
then the iteration $ is of order greater than r, provided only 

(3.33.6) fo' (l) () - 1M 2) () - 1] ^ 0. 

In fact if r = 1, and this condition (3.33.6) is satisfied, then $ is of order 
2 at least, and if r > 1, then $ is of order 2r 1, and condition (3.33.6) 
is necessarily satisfied. 

Because of the invariance, it is sufficient to consider 



U 

where the sequences of deviations satisfy 



Then 
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Hence if r > 1, the term of lowest degree in the numerator is the second 
term which is of degree 2r, while the term of lowest degree in the denomi- 
nator is the term , itself of degree 1. Hence^an expansion of the fraction 
in powers of will begin with a term of degree 2r 1, and this exceeds 
r when r > I. 

If r = 1, then to form the numerator we have 



whence on subtraction the terms in 2 drop out, and the term of lowest 
degree in is of degree 3. For the denominator we have 



= (1 - <' - < + a<?a<?)S + ' ' ' 
= (1 - <)(! - ?>) + 
But 

i i} = *{(), ( i 2) = *:(), 

so that, if (3.33.6) is satisfied, the expansion of the denominator begins 
with a term in the first power of . Hence the expansion of the fraction 
begins with a term of degree 2 at least, and the iteration is therefore of 
order 2 at least. 

Thus given two iterations of the same order, one can always form an 
iteration of higher order. But it was nowhere required that </><i) and <f>w 
be distinct, so that from any single iteration one can form an iteration of 
higher order. More than this, an iteration $ of order r > 1 always 
converges in some neighborhood about a, whereas an iteration < of order 1 
need not converge and will not unless |<'()| < 1, as we have seen. 
Hence from any function <f>, analytic in the neighborhood of a and satisfy- 
ing <f>(a) = a, one can form an iteration which converges to a whether 
that defined by <t> converges or not. 

In ordinary practical application it is not desirable to form $ explicitly. 
Instead, one can proceed as follows: One forms 



in the usual manner. However, for # 3 one takes not <A(# 2 ) but $(X Q } by 



In terms of the difference operator A, defined by 



this can be written 

z 3 = XQ (Aar ) 2 /A 2 a;o, 
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and more generally one can take 



This form brings out the fact that in practical computation any sequence 
of digits on the left that is identical for x^ aj3,+i, and x^ +2 can be ignored, 
since it drops out in forming the differences and is restored in subtracting 
from Xz . 

Just as the iteration $ was of order higher than 0, so one can form from 
4> an iteration of still higher order. Thus having computed 



instead of computing <(#e), one could now form 



In principle, iterations of arbitrarily high order can be built up by pro- 
ceeding in this manner. 

3.34. Iterations of Higher Order: Schroder 1 s Method. ' The oldest method 
of obtaining iterations of arbitrary order seems to be that given by 
Schroder. Consider any simple root a of 

(3.34.1) f(x) = 0. 
In the neighborhood of a. we can set 

(3.34.2) y - /(x) 
and let 

(3.34.3) x = h(y), 
where 

(3.34.4) x m h\f(x)], y - /[%)], 
identically. Hence 

(3.34.5) a - fc(0). 

If we expand 

X = h(Y) 

in powers of (Y y), regarding x and y as fixed, we have 

X - x + (Y - 2/)/n(y) + (Y ~ 2/)%(t/) + , 
where 

(3.34.6) hr(y) - h^(y)/rl 
Hence 

(3.34.7) a - x - yh^y) + + (-y)'Wy) + (- 
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Now the h r are functions of y, but y is a function of x by (3.34.2), so that 
we may write 

(3.34.8) 1 

with 



(3.34.9) 6i - I//', b r 
Then (3.34.7) can be written 

(3.34.10) -*-/&!++ (-/)'&, + (-f)'+ l R r+l (f) 
identically. If we define <j> r by 

(3.34.11) B 4> r (z) + (-/)'+'JS r+1 (/), 

then we see that < r provides an iteration of order r + 1, since its rth 
derivative must vanish with /. Again fa yields Newton's method. 
The quantities b { (x) required for the iteration 

(3.34.12) </>,(*) = *-/&!+...+ (-/yi> r 

can be formed by successive differentiation as in (3.34.9), where the prime 
denotes differentiation with respect to x. It is also possible to write a 
system of recursion relations which can be used in case the successive 
derivatives of /(#) are known or easily evaluated. If these are known, 
we can write the expansion of Y = f(X) in powers of (X x) : 

Y - y 



where Oi(x) = f w (x)/i\. If this is substituted into the expansion of 
X x given above, we obtain 

X-xm bi[ai(X - x) + a z (X - x)* + a 3 (X - a:) 3 + ] 



after replacing the h r (y) by b r (x), as in (3.34.8). Now the two sides must 
be equal identically so that 

Ol&l = 1, 

(3.34.13) 



-f a&bi = 0, 



This is the required recursion for expressing the 6,-(a?) in terms of the 
and hence of the derivatives of /. 

3.35. Iterations of Higher Order: Polynomials. Three distinct methods 
have been given for forming iterations of higher order. The S 2 process 
presupposes that some iteration is known and deduces from it an iteration 
of higher order. The methods of Schroder and of Richmond start with 
the equation/ and form directly an iteration of any prescribed order, 
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provided only that the root a is a simple root and that / is analytic, or at 
least possesses derivatives of sufficiently high order, in the neighborhood 
of a. Clearly if g(a) ^ and is analytic in the same neighborhood, one 
could apply either method to the equation F = 0, where F = fg or where 
F ss f/g. Thus there are a great many ways of forming an iteration of 
any order, and doubtless many different iterations. In special cases it 
might be desirable to impose upon the iterations special conditions other 
than the order of convergence. In particular if / is a polynomial, one 
might ask that < be a polynomial. This would be desirable in case 
one is using a computing machine for which division is inconvenient; or 
in operations with matrices, where direct inversion is to be avoided; or, 
as Rademacher and Schoenberg have shown, in operations with Laurent's 
series, where direct inversion is impossible. 

We ask now whether, when / is a polynomial, one can find a function g 
such that, when Schroder's method is applied to the equation 

(3.35.1) F=f/g = Q, 

an iteration <f> r will be a polynomial. Taking first 4> i} we wish to choose g 
so that 

(//<7)/(//<7)' = PL 

where p is a polynomial. This requires that 

Pf - p(9'/g)f = i. 

But if / = has only simple roots, one can always find polynomials p and 
q such that 

(3.35.2) pf -<?/=! 
by the theorem of 3.06. Hence if g satisfies 

(3.35.3) g'/g = q/p, 

the requirement is satisfied for fa. 
Now suppose that the process yields 

(3.35.4) *, = x -fpi +/W2! - + (-/Ypr/rl, pi = p 

and that all p,, and hence all fa, are polynomials for i < r. To obtain 
< r +i we must add to this a term 

(-/Y +l pr + l/(r + 1)! = (-*y +1 r + l, 

where the B's are obtained from F as were the 6's from / in (3.34.9). 
Hence 

B r = g r pr/r\, 
and 
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After minor manipulations and applications of (3.35.2) and (3.35.3), this 
gives 



and therefore 

F+Wr+i = f r+l (pp' r + rqp r )/(r + 1)!. 
Hence 

(3.35.5) p r +i - pp' r + rqp r 

is a polynomial, and this recursion, together with (3.35.4), defines poly- 
nomial iterations of all orders. Note that g itself is not required explicitly, 
but only p and q. 
For illustration let 

/ 3= 1 x n a, f ss nx n ~ l a. 
Then 

( x/n)( nx n ~ l a) + (1 x n d) = 1. 
Hence 

p = - x /n, q = -1, 
g'/g = n/x, 

and a possible choice is g = x n . 

We can now evaluate the p t recursively and obtain a polynomial 
iteration of any order for the nth root. But in this case it is easy to 
verify that 



On referring back to the argument used in 3.34 to derive Schroder's 
iterations in general, we may conclude that the first r terms in the 
expansion in powers of / must provide an iteration of order r, and this 
is certainly a polynomial. Direct verification shows it to be the same as 
is given by (3.35.4) and (3.35.5). 

For the case n 1, the iteration of order r is 

(3.35.6) * = x(l + / + / 2 + + /'->), 

and f or r = 2 this becomes on expanding / 

<j> = x(2 ax). 

For the case when a and x are matrices, this defines the Hotelling- 
Bodewig iteration for inverting a matrix. 

On introducing the subject of functional iteration, it was shown 
that the iteration would converge to a when XQ is any point in the com- 
plex plane within a circle about a throughout which a certain Lipschitz 
condition is satisfied. This condition is sufficient but by no means neces- 
sary. In particular, suppose a is real and <f> is a polynomial. The curve 
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y = </> intersects the line y = x&ix = y*=a, and if the iteration is of 
order 2 or greater, then at this point the curve has a horizontal tangent. 
Suppose the curve intersects the line also at some point a! < a but 
nowhere between a 1 and a. Then if a' < XQ < a, the sequence converges, 
whereas the Lipschftz condition holds only in the vicinity of a. Like- 
wise if the curve intersects the line at a" > a but nowhere between a and 
a", then the sequence converges when a < XQ < a". For XQ < a' and 
for # > a", the sequence either diverges or converges to some other 
limit. Thus for the iteration x(l + / + /*) to a" 1 , the curve and line 
intersect at x = 0, cr 1 , and 2a~ 1 , and the sequence converges to or 1 if 
and only if < XQ < 2ar l . For the iteration #(1 + /), the curve inter- 
sects the line only at x = and a" 1 , while <t>(2ar l ) = 0. This sequence 
converges under the same conditions, as does any iteration (3.35.6). 

3.4. Systems of Equations. All methods so far described have been 
methods for dealing with a single equation in a single unknown. For 
solving a system of equations one would like to follow the Gaussian pro- 
cedure in systems of linear equations, eliminating one variable from each 
of n 1 of the equations, another from each of n 2 of these, and so on 
until there is left a single equation in a single unknown. This of course 
is quite impossible ordinarily. Nevertheless, it is possible to reduce the 
problem to that of solving an infinite sequence of single equations. For 
each equation of the sequence any of the methods described above can be 
applied. The reduction to an infinite sequence can be accomplished by 
the method of steepest descent. 

On the other hand, one can attempt to reduce the problem to that 
of solving an infinite sequence of sets of linear equations. This type of 
reduction is accomplished by an appropriate generalization of the method 
of functional iteration. These two methods will now be described. Only 
the case of real functions of real variables will be considered in this 
section. 

3.41. The Method of Steepest Descent. The method of steepest descent 
applies specifically to the location of the maximum or minimum of a 
function of n real variables. However if we have to solve a set of n 
equations in n variables 

(3.41.1) ft = 0, 
the function 

(3.41.2) * = Sft? 

takes on the minimum value <f> = at all points satisfying (3.41.1). 
More generally if a# are elements of a positive definite matrix, the 
function 

(3.41.3) 4* = 
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also takes on the minimum value <f>* = at the same points. There are 
thus many ways in which the problem of solving a set of equations can be 
replaced by a problem in minimization. We therefore consider the prob- 
lem of minimizing the function <(i, 2, . . . , n), or briefly $(), where 
x is the vector whose components are the &. The partial derivatives 



are components of a vector which we shall designate as <f> x and take to 
be a column vector. This is known as the gradient of <, often denoted by 
grad or V</>, and its direction at any point x is normal to that surface 

(3.41.4) <j> = const ; 

which passes through the point x. 

In the neighborhood of the point x which minimizes #, the surfaces 

(3.41.4) are closed if, as we suppose, <j> is continuous, since if we take the 
constant sufficiently close to the minimum value, then along any ray 
through the point x the function <fr can only increase, and at some point 
along the ray the function will first take on the value of the assigned 
constant. 

Now suppose that XQ represents some initial estimate to the required x, 
close enough to it so that the surface 

(3.41.5) *(*) = *(*) 

is closed, and that the point x lies in the region enclosed by the surface. 
Then choose an arbitrary vector u, subject only to the requirement that 

at XQ 

(3.41.6) t* T ^,(a?o) 7* 0. 

This means that the direction u is not tangent to the surface (3.41.5) at 
#o. It therefore cuts through the surface and so intersects surfaces at 
which <f> has smaller (as well as larger) values than at XQ. Determine 

(3.41.7) xi = XQ \u 

as that point on the line through XQ in the direction of u at which < takes 
on its least value. Thus we minimize the function 

(3.41.8) <t>i(\) = 4(x<> - \u) 

of the single variable X. If the line in question should happen to pass 
through the required point x, then we shall have solved our problem, but 
in general this will not happen. However, x\ will be on a surface 

(3.41.9) *(*) - *(*,) 
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at which <t>(x) has a value smaller than it has at XQ. The equation for 
locating x\ is obtained by equating to zero the derivative of < 



(3.41.10) u*4> x (x Q - \u) = 0. 

This is not satisfied by X = because of condition (3.41.6). Equation 
(3.41.10) states that at the point x\ the line through XQ in the direction u 
is tangent to the surface (3.41.9), and this is geometrically evident. At 
a?i, choose a new direction u, not tangent to the surface, and proceed 
sequentially. 

In this way is obtained a monotonically decreasing sequence <(#,) 
that is bounded below by the minimum value <(#) The sequence there- 
fore has a limit at some point #. If 4> x (x<*} = 0, then x^ minimizes </>, 
and necessarily x w = x if XQ is close enough to x so that <f> has a minimum 
at one point only within the closed surface (3.41.5). But if </>*(#,) 7* 0, 
then we can find a direction u not tangent to the surface through X M 
and so proceed farther. But then no limit would have been reached, and 
hence the limit x* = x. 

In the method of steepest descent, strictly so-called, one chooses 
always u = <f> x . Since <f> x is the direction of most rapid variation of <, 
this choice may be expected to lead to convergence after the fewest steps. 
However, each step is fairly cumbersome. As with linear systems, it is 
much simpler to take for each u one of the reference vectors e i} either in 
rotation as in the Seidel process, or selected by some other criterion. It 
is in keeping with the method of relaxation to examine the components of 
<f> x and select the largest. If this is the ith, then one takes u e,-, and 
Eq. (3.41.10) reduces to 



(3.41.11) fa(x - Xe;) = 0, 

which amounts to solving the ith equation for the ith variable as a func- 
tion of current estimates of the other variables. 

Equation (3.41.11) in X will ordinarily be nonlinear, and one of the 
methods of successive approximation described above for an equation 
in one variable will generally have to be applied. Newton's method, for 
example, gives 

(3.41.12) V = fa(x,)/fat t (x ) 

as the first approximation to X. This will not necessarily reduce fa to 
zero, but it should reduce the magnitude of the vector <f> x . If so, it is 
sufficient to take x\ = x X'e and look for the largest component of <f> x 
at this point. 

One should note carefully that the equations <j>^ { = being solved 
explicitly by this method are not in general the same as Eqs. (3.41.1), 
but they are satisfied by any solution x of (3.41.1). The new equations 
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may be more complicated than the original. If Eqs. (3.41.1) themselves 
arise from a minimizing (or a maximizing) problem, then they can be 
used as they are, without forming the function <f> by (3.41.2). 

3.42. Functional Iteration. Newton's method generalizes directly to 
systems of equations. Consider first the general functional iteration 
in n variables. Let g(x) stand for the vector whose elements are 7(i, 2, 
. . . , ). Thus gr is a function of the vector x. Suppose for some 
constant vector a it is true that 

(3.42.1) g(d) = a, 

and consider the sequence defined by some X Q and 

(3.42.2) a^i = g(xi). 

Under what circumstances will this sequence of vectors converge to the 
vector a? 

A sufficient condition for this can be given by means of a Lipschitz 
condition. If, as in Chap. 2, we let b(v) represent the magnitude of the 
numerically greatest element in the vector v, then the sequence (3.42.2) 
converges to the vector a, provided that for some k < 1 and for some p 
it is true that 

(3.42.3) b[g(x r ) - g(x")] < kb(x' - x") 
for every x' and x" satisfying 

b(x' - a) < p, b(x" - a) < p, 

if also b(x a) < p. The proof is made exactly as in the one-dimen- 
sional case if one uses the maximum absolute value whenever the absolute 
value was used before. 

Again, if for some k < 1 the same condition (3.42.3) holds whenever 
x' and x" satisfy 

b(x' #o) < P, b(x" XQ) < P, 
and if also 

b[g(x<>) - x Q ] < (1 



then we can conclude that the sequence (3.42.2) has a limit, which we may 
call a, that 

b(a XQ) < p, 



and that a satisfies (3.42.1). Hence under these circumstances we are 
assured that a solution exists. 

Now consider the system of n equations in n variables 



(3.42.4) <fc(z) = 0, 

where x represents the vector of elements ,-. If each function <fo is 
analytic in the neighborhood of some point XQ, then 
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(3.42.5) to( 



If Eqs. (3.42.4) have a solution x sufficiently close to x 0) that is, for which 
b(x XQ) is sufficiently small, we might expect that by solving the 
equations 

- 



for the quantities $u we would obtain an approximation to the , that 
is better than the <>,,; If / is the vector whose elements are the </>,, and 
if we introduce the matrix 

(3.42.6) /,(* ) = [d*(*o)/dfc], 

which is the Jacobian matrix of the functions <fc evaluated at #o, these 
equations have the form 



(3.42.7) /,(a?o)a?i - f.(x )x - f(x 9 ), - 
and if /*(#o) is nonsingular, then they have the solution 

(3.42.8) xi = x 9 



which is the direct generalization of the iteration given by Newton's 
method. 

The iteration does converge, and it is not necessary that derivatives 
of all orders of the & should exist. However if n is at all large, the 
repeated evaluation of the inverse matrix }~ l or the repeated solution of 
linear systems of the type (3.42.7) is certainly undesirable. Conse- 
quently a somewhat more general theorem will be proved. 

Suppose all functions <fo have continuous first partial derivatives 
in the region of n space being considered, and suppose moreover that this 
region is convex. If < is any one of these functions, and </>* is the row 
vector of its first partial derivatives, then for any x' and x" in the region 



' + 0(x" - x')]/d8 = <f> x [x' + B(x" - x'}](x" - x'}. 
Hence 



*[*' + 0(x" - x'}}(x" - x'}dd = <Kx") - *(*' 
Written for all the functions, this identity becomes 

(3.42.9) /(*") = /(*') + /.[*' + 0(x" - x')](x" - x')d6. 
For brevity write 

(3.42.10) Fix', x"} - f Q l f x [x' + 0(x" - x')]dO. 
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Then 

F(x' x'} es f fr'} 

r V*' 1 X ) Ji>\* )' 

If F(XQ, XQ) is nonsingular, then F(x r , x") will remain nonsingular for x' 
and x" in a sufficiently small neighborhood of x . Define the vector 

(3.42.11) g( x ) m x - 

Then 



and 

0(*") - <7(*') - (x" - 

= (x" - x') -j- 
= [7 - F- 

For x' = x" = XQ, the matrix within the brackets vanishes. For p not 
too large it will be true that for some k < 1 

b[l - F~ l (xo, x )F(x f , x")] < k/n, 
when 6 (x r XQ) < p and b(x" a? ) < p. Then 



Also 



Hence the iteration will converge if XQ is close enough to the solution x 
so that 

- k) P . 



This shows that, if the functions <j>i(x) have continuous first partial 
derivatives in the neighborhood of a solution x, and if the Jacobian 
/* is nonsingular in the neighborhood of this solution, then the iteration 

(3.42.11) converges whenever XQ is chosen sufficiently close to x. It is 
true a fortiori that the Newtonian iteration defined by 

(3.42.12) g(x) m x - fr(x)f(x) 

also converges. 

Unfortunately, the practical question as to the convergence of the 
computations modeled on this method is left undecided in general, and 
round-off errors may make any degree of accuracy impossible. Even in 
the case n = 1, the error in the calculation of a?+i from Xi depends upon 
the errors present in the computation of / and of /*, and the accuracy 
with which these can be calculated depends entirely upon the nature of 
the functions. If the errors in the calculation of/ and /* can be estimated, 
then one can estimate the error in the quotient ///* or in thes olution 
This is the error in the correction Xi+i x i} and when it becomes 



138 PRINCIPLES OF NUMERICAL ANALYSIS 

large by comparison with the computed correction, the further applica- 
tions of the iteration are of no value. 

3.6. Complex Roots and Methods of Factorization. For a single vari- 
able, Newton's method and its generalizations apply equally to the 
determination of real and of complex roots. Also by appropriate applica- 
tion Graeffe's method and Bernoulli's method will yield complex roots. 
However, they are somewhat inconvenient, and a number of special 
methods have been devised for finding the real and imaginary parts of 
complex solutions. 

Any analytic function /(z) of a complex variable z = x -f iy can always 
be written in the form 

(3.5.1) f(x + iy) = u(x, y) + iv(x, y), 

where u and v are real functions of the real variables x and y. In par- 
ticular if / is a polynomial, one can write Taylor's series 

(3.5.2) f(x + iy) = f(x) + iyf'(x) - 

whence 

u(x, y) =f(x) - y 2 f"(x)/2\ 



so that u and v/y are functions of x and y*. 
Any solution z of 

(3.5.3) /() = 

must be in the form z x -\- iy, where x and y are real and satisfy 

(3.5.4) u(x, y) = v(x, y) = 0. 

These are real equations in the real variables x and y, and can be treated 
by either of the methods described in 3.4. 

In the case of a polynomial /, it is possible to eliminate y, obtaining a 
single equation in x to be solved for real roots only. By substitution one 
can obtain the equations in the associated ?/'s. For brevity write 

u s a + a 2 ?/ 2 + + a, m y 2m = 0, 
y-*v 3= ai + a 3 y 2 + + a^+iy 2 " 1 = 0. 



Multiply the first equation by a\ t the second by a , subtract, and divide 
through by y 2 . The resulting equation is of degree m I in y 2 . If 
a 2m +i 5* 0, multiply the first equation by o 2 +i, the second by a 2wt , and 
subtract. This gives a second equation of degree m 1 in y 2 , and these 
two can be treated as were the original two. Eventually there results an 
equation of degree in y 2 . If a 2 m+i = 0, continue with the equation 
resulting from the first elimination along with y~ l v 0. 

For applying Newton's method or one of its generalizations to the 
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original equation, one can also separate into its real and imaginary 
parts, writing 



(3.5.5) <f>(x 4- iy) = $(x, y) 4- iw(x, y). 

By a slight modification of the sequence z+i = 4>(zi), one can write 



When /(z) is a real polynomial, all complex roots occur in conjugate 
pairs, and the roots of a conjugate pair satisfy a real quadratic equation. 
Hence a real polynomial /(z) can be completely factored into real quad- 
ratic and linear factors. If the coefficients of a real quadratic factor of 
/(z) are known, it is then a simple matter to find its zeros, whether they 
may be real or complex. 

Let 

(3.5.6) d(z) s z 2 4- az + b. 

If Zj satisfies 

z? = azj b, 

i.e., if Zj is a zero of d(z), then zy satisfies also 

zf = a( azj b} bz, 
= (a 2 6)zy 4- ab, 

and in general any positive integral power of zy is expressible as a linear 
function of Zj whose coefficients are polynomials in a and b. Hence also 
/(zy) is so expressible: 

, b}zj + r (a, 6). 



If d(z) is not a perfect square, it has a zero z* 5^ zy and 

/(z/t) = ri(a, 6)z fc 4- r (a, 6). 
If zy and z fc are zeros also of /(z) (whether complex or not\ then 

/(*/) = M) = o, 

and for zy and Zk distinct this implies that 
(3.5.7) ri(a, b} = r (o, 6) = 0. 

Here are two equations in the two unknowns a and b which must be 
satisfied by the coefficients of a quadratic factor d(z) of /(z). 

The polynomials r and TI in a and b could be determined equally 
well in a slightly different manner. Dropping the subscript on the z, 
if z is any zero of d(z), we can say that 
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Hence if f(z) is of degree n, this substitution reduces it to a polynomial 
of degree n 1 in the zeros of d(z), and the coefficients of the poly- 
nomial are themselves polynomials in a and b. Likewise 



and by continuing sequentially in this way, /() is again reduced to the 
form 



(3.5.6) /() = ri(a, b)g + r(o, 6), 

valid for any 2 satisfying d(z) = 0. But now one can see easily that 
this is precisely the process for forming the remainder after division of 
f(z) by d(z). Hence for all z it is true that 

(3.5.9) /(z) m (z* + az + 6)QiW + zr^a, 6) + r (a, 6), 

where Qi(z) is the quotient. Hence if a and b satisfy (3.5.7), the division 
is exact. Hence the conditions (3.5.7) are sufficient, as well as neces- 
sary, for d(z) to divide /(z). 

For the solution of Eqs. (3.5.7) Hitchcock gives an iteration which is, 
in fact, an application of Newton's method. If one now divides Q\(z) 
by d(z), then/(z) can be written 

(3.5.10) /(z) - (z* + az + b)*Q(z) + (z 2 + az + b)q(z; a, 6) + r(z; a, b), 

where 

; a, b) = zqi(a, 6) + q 9 (a t 6), 



5 

* ' r(z; a, 6) s zr^a, b) + r (a, b}, 

and q(z; a, b) is the remainder after the second division. 

Let z represent either zero of d(z). Then from differentiating (3.5.10) 
with respect to a and to b, since / is itself independent of a and b, it 

follows that 

= zq + dr/da, 



In detail these equations are 



z*qi + zqo + zdri/da + dr /da = 0, 
o + zdri/db + dr /db = 0. 



These equations must hold for any zero z of d(z). Hence the first can 
be written 

z(dri/da + q Q aqi) + (dr /da - bqi) = 0, 

and the second 

z(dri/db + qi) + (3r /db + q 9 ) = 0. 



If d(z) is not a perfect square, then the coefficients of z and the terms free 
of z must vanish separately. Hence 
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(3 5 12") dri/da = 0^1 qo, dr^/da = 6gi, 

dr Q /db = -q . 



These are the partial derivatives required for the application of Newton's 
method to the solution of (3.5.7), and they are obtained by two divisions 
of/(z) by (z 2 + az + 6). 

Hence if a and 6 represent approximations to the coefficients a and b 
of an exact division d(z), then improved approximations a a +i, 6+i can 
be obtained by solving 

b a ) = ri, 

i \ 

b a ) = r , 

where qo, q\, TO, ri are the quantities obtained after division of /(z) twice 
by z 2 + a a z -f 6. When the process is carried out in this way, the 
general forms of the polynomials TQ and ri in a and b are not obtained. 
Instead their numerical values and those of their partial derivatives are 
obtained by the divisions which use the current numerical approximations 
a a and 6. 

This method can be generalized to yield a factor of arbitrary degree. 
If one writes down formally a factorization of /(z) into factors with 
unknown coefficients, then by expressing that /(z) is to equal identically 
the product of these factors one obtains a set of equations relating the 
unknown coefficients. Let the unknown coefficients be represented as 
fli, 02, . . . , <fo taken in any order, and let the conditions be written 

^i = ^2 = = ^AT = 0, 

where each ^ is a polynomial in the a's. If with each ^ t one can associ- 
ate an % in such a way that ^ = is easily solved for ay as a function 
of the other a's, one can use this fact to define formally an iterative 
scheme for evaluating the coefficients in the factorization, and many 
different such schemes have been proposed. Generally, however, the 
question of convergence is left open. 

3.6. Bibliographic Notes. The author is indebted to Professor 
Schwerdtfeger for numerous references on iterative methods for both 
linear and nonlinear equations, as well as for a copy of some lecture notes. 
And at this point reference may be made to Blaskett and Schwerdtfeger 
(1945) on the Schroder iterations. 

Konig (1884) published the theorem that is now classic in the theory 
of functions. Hadamard (1892) elaborated this and related notions, with 
reference to the location and characterization of singular points of analytic 
functions, and made application to the evaluation of zeros. A more 
recent discussion is that of Golomb (1943). Aitken (1926, 1931, 1936- 
19376) discusses extensively the use of Bernoulli's method and the S 2 
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process. Whittaker and Robinson (1940, and other editions) discuss 
Bernoulli's method and Whittaker's method. 

The general convergence theorem for functional iteration is given by 
Hildebrandt and Graves (1927; see also Graves, 1946). Numerous 
special methods for solving equations of all types, like Picard's method 
for differential equations, are methods of functional iteration. Hamilton 
(1946) gives a number of special convergence theorems and (1950) gives 
an analytic derivation of the algorithm derived geometrically by Rich- 
mond (1944) and previously obtained by Critchfield and Beck (1935). 
Schroder (1870), Runge (1885), and others had given similar algorithms 
for algebraic equations. 

If one defines the functions </ by Xi = #fe-i) = <&fe-y)> these func- 
tions are of a class called "permu table" or " commutative," on which 
there is an extensive literature. If z is a real or complex number, or 
a point in a general space, and satisfies x = <j>(x), then x is said to be a 
"fixed point" of the transformation <, and comes into consideration in 
the topological literature. 

Collatz (1950), Wenzl (1952), and others have described iterations 
converging to a power of a root. On questions relating to errors and 
rates of convergence, see Ostrowski (1936, 1937, 1938) and Bodewig 
(1949). 

The iteration for quadratic factors of a polynomial was given by 
Hitchcock (1938, 1939, 1944); generalizations are given by Lin (1941, 
1943), Luke and Ufford (1951), Friedman (1949), where the factors are of 
arbitrary degree. There are many ways by which to define an iteration, 
but the treatment of convergence becomes more difficult when both 
factors are higher than quadratic. 

As it applies to algebraic equations, Graeffe's method is frequently 
treated both in the textbooks and in the periodicals, and several references 
are given in the bibliography. The differential technique was given by 
Brodetsky and Smeal (1924) and was applied to transcendental equations 
by Lehmer (1945). For an elegant treatment of convergence in the 
general case see Polya (1915), and for a treatment of error see Ostrowski 
(1940). 

Most of the basic principles in the theory of equations cited here 
are to be found in standard textbooks, but for properties of the Vander- 
monde determinants and identities relating the several types of sym- 
metric functions see Muir's "Theory." 



CHAPTER 4 
THE PROPER VALUES AND VECTORS OF A MATRIX 

4. The Proper Values and Vectors of a Matrix 

The characteristic function of a matrix A was defined in Chap. 2 as the 
polynomial 

(4.0.1) <KX) = \A - X7| = (-1)"(X + axX"- 1 ++<*) 

obtained by expanding the determinant of the matrix A X7. The 
characteristic equation is the equation <f> = 0, and its roots are the proper 
values of the matrix. For any proper value X, the matrix A X7 is 
singular, whence the equation 

(4.0.2) Ax = \x 

has at least one nontrivial solution x, and any solution is a proper vector 
associated with the proper value X. Of fundamental importance to the 
study of proper values and vectors is the Cayley-Hamilton theorem, 
which states that 

(4.0.3) <f>(A) = 

identically. In words, any matrix satisfies its own characteristic equa- 
tion. In special cases a matrix A may satisfy an equation of lower degree, 
say 

(4.0.4) iKX) = 0. 



Equation (4.0.4) of the lowest degree satisfied by A is called the minimal 
equation, and the polynomial ^(X), the minimal function. The minimal 
function ^(X) divides 4>(X), and on the other hand every proper value 
is a root of the minimal equation. 
To show that $(\) divides </>(X), let 

*(X) = 0(X)*(X) + r(X), 

where <?(X) and r(X) are polynomials, and r(X) is of lower degree than 
^(X). Since $(A) = 0, and (A) = 0, it follows that r(A) = 0, and this 
can be true only if r(X) 0. The same argument can be used to show 
that V'(X) divides any polynomial w(X) for which <a(A) = 0. 

143 
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Before showing that all proper values satisfy (4.0.4), we show that, if 
h(\) is the highest common divisor of the elements of adj (A X7), then 



(4.0.5) 

We know that 

(4.0.6) (A - X7) adj (A - XT) = 

If PX) is denned by 

adj (A - XT) - *(X)P(X), 

then P(X) is a matrix whose elements are polynomials in X, and these 
polynomials in X have no nonconstant common divisor. Now (4.0.6) 
becomes 

h(\)(A - X/)P(X) = 



But A X7 and P(X) are matrices whose elements are polynomials in X. 
Hence 

(4.0.7) (A - XJ)P(X) = m(X)7, 

where m(X) = <f>(\)/h(\) is a polynomial. 

When A replaces X in (4.0.7), the left member vanishes. Hence 
m(A) = 0, and therefore ^(X) divides m(X). It remains to prove that 
m(X) divides ^(X). We can find a polynomial matrix Q(X) and a constant 
matrix Qo such that 

lKX)7 B (A - X7)Q(X) + Qo 

identically by expanding and equating coefficients of like powers of X. 
But since t(A) 0, it follows, on replacing X by A, that Qo = 0. Hence 



m (A - X7)Q(X). 
Since ^(X) divides w(X), we can write 

m(X) = 



whence by (4.0.7) 

m(X)J = fc(X)^(X)7 = fc(X)(A - X/)Q(X). 
By comparing this with (4.0.7) we conclude that 

- P(X). 



Hence every element of P(X) is divisible by the polynomial fc(X). Hence 
k(\) is a constant, and therefore m(X) and ^(X) differ at most by a con- 
stant multiplier. If we now take the determinants of the two sides of 
(4.0.7), we have 

*(X)|P(X)1 - 



whence every linear factor of <(X) must be also a factor of 
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If x is any vector, then since $(A] = 0, it is certainly true that 

t(A)x = 0. 

For a given vector x ^ let h(\) be a polynomial for which it is also 
true that h(A)x = 0. Then if d(\) is the highest common divisor of 
/i(X) and ^(X), it is true that d(A)x = 0. In fact, one can find poly- 
nomials p(X) and q(\) such that 



q(\)h(\) - 
whence 

q(A)h(A) = 



and the conclusion follows on multiplying this identity into x. There is 
therefore a polynomial of lowest degree h(\) for which h(A)x = 0. This 
must divide ^(X), since otherwise the highest common divisor rf(X) is of 
still lower degree than h(\), contrary to the hypothesis. 
If Xj is any proper value, then 



is a polynomial, since, as was shown above, every proper value satisfies 
^ = 0. Since ^ is the polynomial of lowest degree for which $(A) = 0, 
it follows that &(A) 7* 0, whence for some x it is true that $i(A}x j* 0. 
But 

*(A) = (A - XJ)WA), 
(A - UMAlx = 0, 



and therefore fa(A)x is a proper vector corresponding to the proper value 
X,-. That is to say, any non-null linear combination of the columns of 
fa(A) is a proper vector. 

Let X be a v^-iold root of ^ = 0, and now let 

(4.0.8) fc(X) = ^(X)/(X - Xi)><. 

Then (X X,-)"* and ^(X) are relatively prime, whence for some poly- 
nomials p(X) and 



p(X)(X - X<)'< + fl(X)WX) a. 1, 
identically. Hence 

p(A)(A - XJ)N + q(A)t<(A) 7. 
Let y T* be a proper vector corresponding to X. Then 
(4.0.9) to(A)q(A)y = y. 

Hence y is a linear combination of the columns of &(A). If 2 is any 
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linear combination of these columns, then the last nonvanishing vector 
in the sequence 

20 = Z, 

(4.0.10) * * l = ( , A A ~ ^*' 

22 = (A - A</)ZI, 



is ri proper vector associated with X t , and all vectors of the sequence are 
principal vectors. 

Among the schemes for finding the proper values of a matrix, some lead 
directly to the characteristic function <f>, to the minimal function ^, or to 
some divisor co(X) of the minimal function. When this function is 
equated to zero, the resulting equation is then to be solved by any con- 
venient method. The scheme for finding the polynomial <, \f/, or co, as 
the case may be, may or may not have associated with it a scheme for 
finding the proper vectors. If the scheme provides only some w, and 
not necessarily ^ or <, it may be necessary to reapply the scheme in order 
to obtain the remaining proper values. 

Other schemes are iterative in character, depending upon the repeated 
multiplication of a matrix by a vector. A scheme of this type ordinarily 
leads to a sequence of vectors having a proper vector as its limit and to a 
sequence of scalars whose limit is the associated proper value. Before 
describing these methods in detail, we shall introduce a few further 
preliminaries. 

4.01. Bounds for the Proper Values of a Matrix. Since a nonsymmetric 
matrix may have complex proper values, and hence complex proper 
vectors, it is necessary to give further consideration to complex matrices. 
The natural generalization of a symmetric real matrix is a Hermitian 
complex matrix. The matrix A is Hermitian in case it is equal to its own 
conjugate transpose, i.e., to the matrix obtained when every element is 
replaced by its complex conjugate, and the resulting matrix is then trans- 
posed. Let a bar represent the conjugate (as is customary), and an 
asterisk represent the conjugate transpose. Then the matrix A is 
Hermitian in case 

(4.01.1) A* = 1 T = A. 

If A is Hermitian, and x is any vector, real or complex, then x*Ax is a 
real number. For if we take its complex conjugate, we have x*Ax; but 
this is a scalar and is equal to its own transpose # T A T a5* T = x*A*x**. 
However a;** = x, and the theorem is proved. Hence we can define a 
positive definite Hermitian matrix as a Hermitian matrix for which 
x*Ax > whenever x 7* 0, and a non-negative semidefinite matrix as 
one for which x*Ax > for every x. Only a singular matrix can be 
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semidefinite without being definite. Clearly a Hermitian matrix all of 
whose elements are real is a symmetric matrix. 

Analogous to a real orthogonal matrix, i.e., a matrix V such that 
= 7 = VV 1 , is a unitary matrix U, which is one such that 



(4.01.2) U*U = I = UU*. 

A unitary matrix with real elements is orthogonal. 

The proper values of a Hermitian matrix are all real, since if Ax = Xo;, 
then x*Ax = \x*x, and both x*Ax and x*x are real numbers. Also, if 
complex vectors x and y are said to be orthogonal when x*y = y*x = 0, 
then proper vectors associated with distinct proper values of a Hermitian 
matrix are orthogonal. For if 

Ax = \x, Ay = 

then 

y*Ax = \y*x, x*Ay = 
But 

y*Ax = x*Ay, y*x x*y, 

whence if X ?* /z, this implies that x*y = 0. 

If A is Hermitian, there exists a unitary matrix U such that 

(4.01.3) U*AU = A, 

where A is a diagonal matrix whose elements are the proper values of A, 
and where the columns of U are the proper vectors of A. This corre- 
sponds to the case of the real symmetric matrix, and the argument can 
be made by paraphrasing that given in the real case. 

If A is any matrix, Hermitian or not, any scalar of the form x*Ax/x*x 
for x 7* is said to lie in the field of values of A. Any proper value of A 
lies in its field of values. For if Ax = \x, then x*Ax = \x*x. 

If A is any matrix, then A* A is Hermitian and semidefinite; it is 
also positive definite if A is nonsingular, for then Ax j whenever x 9* 0, 
and hence x*A*Ax = (Ax)* (Ax) > 0. If the proper values of A* A are 
Pi > pi > ' * ' > Pn > 0, and X is any proper value of A, then 

(4.01.4) p\ > XX > pi. 
For if Ax \x, then x*A* = Xz*, and hence 

X A AX == \\X X. 



Hence XX is in the field of values of A* A. But if a is any number in 
the field of values of A* A, then for some a with a*a =- 1 it is true that 
a*A*Aa a. If 

U*A*AU 
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where P 2 is the diagonal matrix whose elements are the pf , and if a Ub t 
then 

a - b*U*A*AVb = 



all products &/3 are real and positive, and hence a is a weighted mean of 
the p?. Hence cannot exceed the greatest nor be exceeded by the least 
of the p?: 

Pi > a > pj. 

Since XX is such am a, the relation (4.01.4) now follows. 

If X is a proper value of A with multiplicity v, then X + /* is a proper 
value of A -j- pi with multiplicity v. For this reason, the following 
classical theorem can provide information as to the limits of the proper 
values of a matrix: If for every i 

(4.01.5) 2|<x| > 

then the matrix A is nonsingular. 

If the matrix were singular, then the equation Ax = would have a 
nontrivial solution. Among the elements of x, let & be an element of 
greatest modulus, |&| > |&| for all j. Then 



But 



whence 



Hence we have a contradiction, and the theorem is proved. 

In some cases the theorem remains valid even when some, but not all, 
of the inequalities in (4.01.5) become equalities. Suppose, for example, 
that 

n 
(4.01.6) \a u \ > y 

y-2 

Then if & has the same significance as before, there is at least one & for 
which |*| < |&|. For 

n 

- 1 ' - 2 

J-2 
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whence on applying the hypothesis it is clear that, for some j, 
and the 's are therefore not all equal in modulus. Now 




but this is inconsistent with (4.01.6) unless a ik for every k such that 
|b| < |&|. If this is so, and if there are v values of j such that |&| = |&|, 
then there are n v values of k for which a = 0. But also by the 
same argument # = for each such k and every j for which | ; | = |&|. 
By performing a suitable permutation on the rows of A and the same 
permutation on the columns, we can assume that 

- j = n v-\-l,n v + 2, ...,n, 
k = 1, 2, . . . , v. 

The matrix is then in the form 

(4.01.7) 

where P and R are square matrices. Hence if the matrix is not one which 
can be given the form (4.01.7) by any permutation of rows, accompanied 
by the same permutation of columns, then the conditions 

(4.01.8) 2|<x| > 

with a proper inequality for at least one value of i are sufficient to ensure 
the nonsingularity of the matrix. 

Obviously the above argument can be applied to A T . 

Now let 

(4.01.9) Pi = 

If we apply the above results to the matrix A A/, it is clear that, if X is 
a proper value, A X7 becoming singular, then either 

(4.01.10) ' |X - <x| = Pi 

for every i, or else for some * it is true that |X <XH\ < P t . In either 
event it is true that the proper value lies within or on the boundary of at 
least one of the n circles in the complex plane denned by (4.01.10). On 
applying the same argument to A T , we conclude that every proper value 
lies within or on the boundary of at least one of the n circles denned by 



(4.01.11) |X - | - 
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4.1. Iterative Methods. These methods provide for the direct manipu- 
lation of the matrix itself or of some matrix simply related to it, without 
necessitating the explicit development of the characteristic or other 
polynomial. We begin with the relatively simple case of a Hermitian 
matrix. 

4.11. The Matrix A Is Hermitian. If the proper vectors Ui of a 
Hermitian matrix were all known, these could be normalized to unit 
length u*Ui 1, and they would form the columns of the unitary matrix 
U such that 

(4.11.1) U*AU = &, U*U - I, 

where A is the diagonal matrix of proper values. We may assume the 
Ui to be so ordered that 

(4.11.2) Xi > X 2 > > X n . 
If x is any vector and y satisfies 

x = Uy, y = U*x, 
then 

x*x = y*U*Uy = y*y. 

Hence the field of values of A and that of A are identical, and if a. is in 
this field of values, then 

Xi > a > X n . 

If p(X) is any polynomial in X, the matrix p(A) has the same proper 
vectors as A itself, and its proper values are p(\i). In particular, the 
matrix A 2 is necessarily non-negative, definite or semidefimte. More- 
over, for ju sufficiently large, A -f- ju/ is positive definite. It is therefore 
no essential restriction to assume, when convenient, that A is positive 
definite or is at least semidefinite. 

In 2.06 the trace tr (A), which is the sum of the diagonal elements, 
was shown to be equal to the coefficient of X n-1 in the characteristic equa- 
tion, except possibly for the sign, and to be equal to the sum of the proper 
values, 

tr (A) = 



More generally, if p(X) is any polynomial, then from (4.11.1) it follows 
that 



(4.11.3) tr ]p(A)] = 
Moreover, if v is any integer, and A is nonsingular, 

(4.11.4) tr (A-") - 2X7', 
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The norm N(A) of a real matrix A, symmetric or not, was defined to be 
the square root of the sum of the squares of the elements. Equivalently, 
this is the square root of the trace of A"* A. For a complex matrix, 
Hermitian or not, it can be defined by 

(4.11.5) N*(A) = tr (A* A). 

As so denned, tr (A* A] is a positive real number, and for N(A) the posi- 
tive root is to be taken. Hence if A is Hermitian, 

AT 2 (A) = tr (A 2 ) = tr (A 2 ) = SXJ. 

Now N*(A) is the sum of the squares of the moduli of all elements of A. 
Obviously this is equal to the sum of the squares of the diagonal elements 
of A if and only if A is a diagonal matrix. Hence 

ZJ < AT 2 (A), 

and the equality holds only when A is diagonal. Now the unitary trans- 
form F*AF of A, where V is any unitary matrix, has the same norm as 
does A, whereas in general the sum of the squares of the diagonal ele- 
ments is not the same. Hence among all unitary transforms of a Her- 
mitian matrix A, the transform (4.11.1) maximizes the sum of the squares 
of the diagonal elements or minimizes the sum of the squares of the 
moduli of the off-diagonal elements. 

4.111. The largest and smallest proper values. If A is non-negative 
semidefinite, then for v > 

X? < tr (A") = SX? < 
(4.111.1) Xi < [tr (A")] 17 ' < 



But as v increases, n l/v > 1. Hence 

(4.111.2) [n- 1 tr (A")] 17 " < Xi < [tr (A*)] 1 '", 
and 

(4.111.3) lim [n~ l tr (A")] 1 /" = lim [tr (A")] 17 " = Xi. 



V 00 



The two sequences obtained from the successive powers A * of A approach 
Xi from above and from below. The most effective application of this 
algorithm is made by successively squaring the matrix A and forming 

A. j A. j A. j jLJi j . ^ 

with v taking on the values 2 p . The degree of convergence is measured 
by the ratio w 1/v :l or by n 2 " p . 
If x Uy is any vector, then 

(4.111.4) A'x = U\'U*x = 
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If Xi > A* > for i = 2, 3, . . . , n, or if Xi > X 2 and \i > |X|, then 
since 

A'X = 



as v increases all terms but the first within the brackets approach zero, 
and in the limit, 

(4.111.5) . A v x - XftiWi, 

provided only 171 5^ 0. That is to say, as v increases, the vector A "x 
approaches a vector in the direction of the first proper vector UL It is 
necessary only to normalize to obtain u\ itself. 

If i}i = T 172, the same argument shows that A"x approaches a 
vector in the direction of u z , provided X 2 exceeds Xs, . . . , X n numerically. 

To square a matrix of order n requires n 3 multiplications, whereas to 
multiply a matrix by a vector requires only n 2 . If 

(4.111.6) x v = A"x = Ax v -i, 

then for large v it appears from (4.111.5) that x y+ i = \ix v . Hence, 
although if v = 2 P , v increases rapidly with p, it may be more advan- 
tageous to form the sequence x v directly as in (4.111.6) than to square 
the matrix several times and then multiply by x. Moreover, a blunder 
made in computing x v will be corrected in the course of subsequent 
multiplications by A. The two methods are related essentially as are 
GraeftVs and Bernoulli's methods for solving ordinary equations. It 
might be pointed out further that, if by a rare chance it should happen 
that iji = 0, nevertheless round-off will introduce a component along 
u\ in the x v , and this component will build up eventually, though perhaps 
slowly. 
Let 



(4.111.7) a p = x*x p , v = y*y 



p - P . 



Then a p is independent of v, and in particular 



a p - x*A p x = 2/ 
Hence, if 171 ?^ 0, 

otp+i = SX? +1 i7i7Ji < X 1 SXfr/^ t - = \iotp, 

(4.111.8) P+I/P < Ai, 

and 

(4.111.9) a p+ i/a p > Xi. 

Thus a p+ i/a p approaches Xi from below. 

Since in the limit x v +\ \ix V) the ratio of any element of x v+ i to the 
corresponding element of x v provides also an estimate of X 1? when v is 
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sufficiently large. The agreement among these n ratios is an indication 
of the nearness to the limit. 

As v increases, the elements of x v and of A" become large if \i > 1, or 
small if Xi < 1. Hence in actual computation it is convenient to modify 
the sequence of x, or the sequence of A" by introducing factors K, to hold 
the quantities within range. Thus one could compute the series 



where each K V may be selected according to convenience. For succes- 
sively squaring the matrix, Bargmann, Montgomery, and von Neumann 
propose the sequence 

Bo = pA/tr (A), 



where p is a scalar slightly less than unity. For p = 1, all quantities 
would be not greater than unity in principle, though this might fail as a 
result of round-off. By a suitable choice of p, one can be sure of staying 
within the range from 1 to +1 even with round-off. 

If p > Xi, then A' = nl A is Hermitian, positive definite, and has 
the greatest proper value \( = M X n . Hence no special discussion is 
needed for finding the smallest proper value. 

4.112. Accelerating convergence. Of the schemes for accelerating con- 
vergence, the simplest is Aitken's d 2 process, described in Chap. 3. This 
can be applied to the sequence a p+ i/a p for finding Xi, to the sequence of 
ratios of corresponding elements of x v+ \ and x v) and to the sequence x v for 
the proper vector. 

In the sequence x v , the rapidity of the approach to the limit depends 
upon the smallness of all ratios X,/Xi for i > 1. Any matrix A' which 
reduces the greatest of these ratios will provide more rapid convergence. 
One of the simplest choices might be A 2 or A 4 . This would mean taking 
one or two matrix products, requiring n 8 multiplications each, and follow- 
ing this with a sequence of products of a matrix by a vector, requiring n 2 
multiplications each. Having obtained, say, the vector x 2v , one could 
then obtain Ax^ v = x* v +\ to find Xi without a root extraction. 

Another possibility is to replace A by 

A' = A + M/, 
and hence the proper values X, by 

where /t is judiciously selected. 

Assuming, as always, the ordering (4.11.2), and for present purposes 
that the strict inequality Xi > X2 holds, the best choice of p will be that 
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for which the greatest of the ratios |X'./X(|, i > 1, is least. But the 
greatest of these is either |X/Xj| or IX^/X'J or both. The optimal M is 
therefore that for which these ratios are equal, since a different selection 
of /* would increase one of these ratios even though it decreases the other. 
Hence the optimal /* is 



To make the strictly optimal choice, one must know X2 and X n exactly, but 
enough information may be at hand to permit a good choice. 

The iteration of the linear polynomial A -f- /*/ and of the very special 
quadratic polynomial A 2 in place of A is sometimes advantageous. The 
question arises then whether A 2 , or even an A z + M/, for some /* is 
necessarily the best quadratic polynomial, and more generally what is 
the best polynomial of any given degree. It turns out that the best 
polynomial is given by the Chebyshev polynomial of the prescribed 
degree (cf. 5.12). 

If a X' and a X" are known such that Xi > X' > X 2 > > X n > X", 
then it is no restriction to suppose that X' = X" = 1. For if this is not 
the case, one can replace A by A' = (X' - X")- 1 ^ - (X' + X")/]. 
Hence assume this to have been done, and assume further that a 6 is 
known such that 

(4.112.1) Xi > 6 > 1 > X 2 > > X n > -1. 

Then let 

, T M (\) = cos [m arc cos X], 

S m (\) = T m (\)/T m (8). 

Then S m (8) = 1, and w (Xi) > 1. On the other hand, for any X satisfy- 
ing 1 > X > 1 and in particular for X = X t -, i > 1, it is true that 
l$m(X)| < 1. Indeed, the argument developed in 5.12 can be modified 
to show that, of all polynomials q(\) of degree m satisfying q(8) = 1, 
S m (\} is that polynomial whose maximal absolute value on the interval 
from 1 to +1 is least. Hence among all polynomials of degree m 
that might be used for the iteration, S m (A) is the best choice that can 
be made on the basis of the information contained in the hypothesis. In 
other words, in the sequence 



the components along Uz, . . . , u n damp out as rapidly as one can make 
them by applying a polynomial matrix in A of degree m. As a final check, 
and to nullify the effect of any accumulated round-off, one may take any 
tf, as XQ and apply the matrix A itself one or more times. 

The notion of , using a Chebyshev polynomial may be carried a step 
further. The vector x' v is the result of v applications of the polynomial 
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S m (A) of degree m and hence is the result of applying a polynomial 
[ m CA)]" of degree mv. Clearly the direct application of the polynomial 
S m , would give better results. 

Hence, let us return to the original sequence 

XQ } X\j Xfy . . . , X v . 

This is the sequence 

Xj AX) A. Xj . . , A. X. 

Instead of accepting x v itself, for some v, as giving the best approximation 
for the direction of u\, one could ask for a linear combination of all v + 1 
vectors that will give as good an approximation as possible. With the 
same hypothesis (4.112.1), the best available linear combination is 



4- <r\X\ + ' ' ' + ar v x v , 
where 

S v (\] = o-o -f criX + + crA*. 



Again all elements of Ax' should be in the approximate ratio Xi to cor- 
responding elements of x', which fact serves both as a check and as a 
means of obtaining Xi directly without the extraction of a root. 

4.113. Intermediate proper values. Already attention has been drawn 
to the resemblance between the iterative methods for finding proper 
values and the methods of Graeffe and of Bernoulli for solving equations. 
Indeed, Aitken bases his discussion of the method of Bernoulli for alge- 
braic equations upon the iteration of a matrix for which the given equa- 
tion is the characteristic equation. In principle both methods, GraenVs 
and Bernoulli's, yield all the roots of an algebraic equation. 

In 4.21 a particular direct method will be described for finding the 
coefficients of the characteristic equation. That method does not itself 
yield the roots of this equation. However, if the method is applied to 
A' for any v, then one obtains the equation whose roots are the X?. 
Hence if v is sufficiently large, the coefficients of the equation are approxi- 
mately equal to XJ, XJX, XjX^Xg, . . . , if the X's are numbered in order 
of decreasing magnitude. If the matrix is Hermitian, then all roots are 
real. For the treatment of multiple roots, reference is to be made to the 
discussion of GraenVs method in Chap. 3. Unfortunately this method 
does not yield the associated proper vectors. 

For obtaining proper vectors as well as proper values, suppose X- > X<+i. 
Take any y, such that 0<X /*</* X,-+i. Then the numerically 
smallest proper value of A \d is X- /u. Also (X /i) 2 is the smallest 
proper value of the positive definite matrix (A ju/) 2 . Hence the 
problem of evaluating the proper value X, and its associated proper vector 
is reduced to that of evaluating the smallest proper value of the positive 
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definite matrix (A /*/) 2 , and this in turn can be reduced to that of 
finding the largest proper value of a related matrix, as described in 4.112. 
If X /* > M X,-+i > 0, then (\t+i ju) 2 is the smallest proper value 
of (A - /i/) 2 . 

If n is any number on the interval from Xi to X n but sufficiently far 
from either Xi or X n , then (A /*/) 2 will have (X ju) 2 as its smallest 
proper value for some i = 2, . . . , n 1. Hence any proper value and 
vector can be obtained independently of all others and so that no errors 
present in one will affect the others. 

A more common procedure for obtaining the intermediate proper values 
is to obtain them in sequence, making each depend upon the larger values 
already found and the vectors associated with them. The vectors w 
are mutually orthogonal and of unit length, which means in the complex 
case that 

I > 

TKe case of a real symmetric matrix A with real proper vectors HI is a 

adpcial case. If the trial vector x has no component in the direction of 
u\\ and if X 2 > |X| for i 3, 4, . . . , n, then x v approaches Uz in direc- 
tion, and ctp+i/ap, as well as the ratio of any element of x v +\ to the cor- 
responding element of x v) approaches X 2 . But if Ui is known, and x' is 
any vector, then 

(4.113.1) x = x' - uiufx f 

is orthogonal to u\. In fact, this is merely the vector which remains 
when the orthogonal projection of x' upon the unit vector HI is subtracted 
from x'. Hence when u\ has been found, a new sequence x v can be 
determined beginning with a vector x orthogonal to UL All vectors x 
will be orthogonal to ui, except when round-off introduces spurious com- 
ponents along 1*1, but these can be removed from time to time by apply- 
ing (4.113.1). Hence X 2 and w 2 can be obtained from this sequence of 
vectors, just as were Xi and HI before. 

In general, for evaluating any X t - and ut t where X t - exceeds in absolute 
value every proper value not already found, one begins with a vector x 
from which has been subtracted the orthogonal projection upon the sub- 
space determined by those proper vectors already found. 

Another method for finding the intermediate proper values and their 
associated vectors is to replace the matrix A by another having X 2 , X 8 , 
. . . , X n , but not Xi, as proper values. The matrix 

Ai SB A \iUiU* 
is Hermitian, and \i. i 7* 1, then 



,, 
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whereas 

\iUi 0. 



Hence A\ has the proper values 0, associated with the proper vector MI, 
and X, i > 1, associated with the proper vector m. Thus Ai satisfies the 
requirements. It is useful to note that 



= A 2 - 
and hence inductively 

(4.113.2) A[ = A" - 

Thus if the powers of A have been formed in the process of arriving at Xi 
and MI, to form the same powers of A i it is necessary only to subtract a 
scalar multiple of the singular matrix u\u*. 
When X2 and w 2 are found, one can form 

A 2 = AI \zUzU* 

and proceed to find Xs and u$. 

In this method the matrices A], A 2, . . . are all of order n, with 
as a multiple proper value replacing each of the proper values already 
found. It is possible, however, to reduce the order of the matrices. 
Thus when Xi and HI are found, let u' 2 be any unit vector orthogonal to u\\ 
u' 3 any unit vector orthogonal to both MI and u' 2 ; . . . ; and finally u' n 
one of the two unit vectors orthogonal to MI, u' 2 , . . . , tC-i- Then the 
matrix 

Ui = (MI, M' S , . . . , <) 

is unitary, and one verifies that 
(4.H3.3) f 




where now A \ is of order n 1 and Hermitian. The matrix U*AUi has 
the same proper values as A; hence the proper values of AI are X2, 
. . . , X n . Let v be any proper vector of AI: 

A\v = Xv. 
Then 



Hence 



and Ui \ ) is a proper vector of A. Hence if the largest proper value of 

w 

AI and its associated proper vector are found, the corresponding proper 
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vector of A can be found directly. The next step is to replace A\ by a 
matrix of the form 

'X 2 



/ 
V 



where A 2 is of order' n 2. 

A simple construction of the matrix U\ is given by Feller and Forsythe. 
Write u\ in the form 

(4.113.4) Ui - 

where w is a vector of n 1 elements and w a scalar. Then it is possible 
to choose n so that 

(4.113.5) Ui = ( W ~ W 

^ ' \w I 

is unitary. For this we must have 

^W I fJLWW*/ \ W I pWW* /' 

and hence by direct calculation 

(4.113.6) = (1 - )/(! - 



Hence when Xi and Ui are found, the transformation (4.113.3), where the, 
unitary matrix Ui is denned by (4.113.5), (4.113.4), and (4.113.6), 
replaces A by a Hermitian matrix A i of lower order, whose proper values 
are the same as those of A which are not yet known, and whose proper 
vectors v yield those of A by the simple relation 

(4.113.7) Ui = 

4.114. Equal and nearly equal roots. Suppose Xi = X 2 > X 3 . Then 
associated with Xi is a two-dimensional set of proper vectors, and any two 
mutually orthogonal unit vectors in this set can be taken as Ui and u z . 
Given any starting vector x, let Ui be the direction of its projection on 
this set. Then x can be written in the form 

X IJiUi + IJalts 4" * * ' 4- 1JnU n , 

while 

x v = A*x = 1/iX, Wi 4" 



and for v sufficiently large 

X V =5 

If z is any other vector, then 

-f 



THE PROPER VALUES AND VECTORS OF A MATRIX 159 

where in general w 2 ^ 0. Hence for large v 

z, A v z = (wiWi + 0)2^2) A;. 

Both vectors, z and x, will be effective in yielding Xi, but in general 
distinct vectors will lead to distinct proper vectors associated with the 
same proper value. However, a third vector w will yield a w v for large v 
such that x v , z V) and w v are linearly dependent. 

On the other hand, if Xi is a triple root, three starting vectors will 
lead to the same Xi but to three independent proper vectors, while four 
starting vectors would approach a set of four linearly dependent vectors. 

If Xi and X2 are nearly, but not exactly, equal, then with each will be 
associated a unique proper vector. Consider two starting vectors x and 
z. For v sufficiently large, we shall have approximately 

x, = 771X5^1 + 172X^2, 

Z v == COiXi^i -f- 0)2X2^2. 

Let 



OL V = 

Then 



2X5, 



v L ~ \v 
J -f- C02C02A2. 

Then the two matrices 





are both singular, and hence Xi and X 2 must satisfy the equation 



(4.114.2) 



X a v+ \ 
X 2 a.,+2 



= 0. 



The method can be extended to the case when three or more of the proper 
values are nearly equal. 

When the powers A" are formed explicitly, there are already at hand 
the vectors x v of n distinct sequences since the tth column of A" is A"ei. 
The ith diagonal element of A " is the a v for the starting vector e. 

If the matrices A" of the sequence approach rank 1, then all columns 
approach the proper vector associated with the single largest proper value. 
If they approach rank 2, then the two largest proper values are equal or 
nearly so. Pick out any two diagonal elements and consider their values 
in consecutive matrices of the series. Let these be ct v and in A", 
and /3,,+i in A v+l . From (4.114.1), the matrix 



fd v 
\H-1 
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Hence if the determinant of this matrix vanishes for large v, then Xi = X 2 . 
If it does not vanish, the roots are nearly equal, by comparison with the 
other proper values, but not exactly equal, and they can be found by 
solving Eqs. (4.114.2). 

4. 1 15. Rotational reduction to diagonal form. A second-order Hermitian 
matrix has the form 

(4.115.1) B = J") f > 0, 



and its characteristic equation is 

(4.115.2) ^ - (0! + 2 ) M + 0!0 2 - 0* = 0. 

The discriminant of this quadratic is 

(0i - 00 2 + 40 2 , 

and since all the 0's are real, this can vanish only in case 0i 2 = = 0. 
Hence the only second-order Hermitian matrices with coincident proper 
values are scalar matrices, which are necessarily diagonal. 

In complex 2 space, a unitary vector can be written in the form 



/V W1 cos 6\ 
v = I ) 

\e iu * sin e/ 



Assume > 0, and let m and /u 2 < MI be the two roots of (4.115.2). Then 
if v is a proper vector associated with jui, 



cos 6 + /3e i ^+ w > sin 6 = 0. 
This can be satisfied by taking 



with w arbitrary, and 

(4.115.3) tan $ - ( Ml - /JO/0 = 0/( Ml - /3 2 ). 



The second expression is obtained by using the second row of B, and the 
two expressions are equivalent since (4.115.2) can be written 



(/i - 00 (M - 00 = 2 . 

Because of this relation, either root M must exceed both 0i and 2 or be 
exceeded by both. But since the sum of the roots is 

MI 4- Ma = 0i + 02, 

it follows that the larger root ni must exceed both, and the smaller root /u 2 
must be exceeded by both. Hence 

tan 6 > 0, 
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and can be taken to lie in the first quadrant. Hence 6 is uniquely 
determined. By applying a standard trigonometric identity, one obtains 

(4.115.4) tan 20 = 20/(ft - ft), 

an expression which does not involve jti. 

The proper vector v\ associated with ni can now be written 



(4.115.5) |0 C0sd 



where 4> is defined by (4.115.1) and by (4.115.4) with the additional 
convention that < < IT. 

If one uses (4.115.4) and then applies trigonometric identities to find 
the functions of 0, then from (4.115.3) and the expression for the sum 
of the roots it follows that 

(4.115.6) MI = ft + j8 tan 0, ^ = ft - ft tan 0. 
Since v% is orthogonal to v\, it is easy to obtain 

tA HKT\ /-e* /2 sin 0\ 

(4.115.7) va = ( , A/2 a }> 
' \ e l * /2 cos 0/ 

so that the unitary matrix V which diagonalizes B to M is 

e* /2 cos 6 -e /2 sin 



2 sin 0\ 
cos d f 



y = 

where 

(4.115.9) V*BV = M. 

One verifies directly that the sums of squares of the diagonal elements of 
B and M are related by 

Now suppose B represents any principal minor of A. For simplicity 
let this be the minor taken from the first two rows and columns of A. 
The matrix 

U l = 
is easily seen to be unitary. If we write A in the form 

(D D* \ 

B, B!I) 

then 

Ai =/*/" ' M 
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Hence in the transform A\ of A by the unitary matrix U\ the sum of the 
squares of the diagonal elements is increased by 2# 2 . The same would 
hold if B were any principal minor with non-null off-diagonal elements 
ottj and ctji, except that the matrix U must be formed by placing the 
elements of V in the z'th and jth rows and columns. 

The sum of the squares of the diagonal elements of A is maximal among 
all the, unitary transforms of A for the diagonal matrix A. Hence an 
infinite sequence of transforms A,, of A, each produced from the preceding 
by a plane rotation as just described, and chosen to nullify a pair of 
off-diagonal elements of greatest magnitude, will approach A as a limit, 
and the infinite product of unitary matrices U v will approach the matrix 
U of proper vectors. The ordering of the proper values down the main 
diagonal of A is taken care of by the convention that ni must be the larger 
of the two proper values of each B, provided the rows and columns in B 
are ordered as they are in A. 

4.12. The Matrix A Is Arbitrary. If the proper values of the non- 
symmetric real matrix or the non-Hermitian complex matrix A are all 
distinct, then there exists a nonsingular matrix W such that 

(4.12.1) W- 1 AW = A, 

and A is again a diagonal matrix of the proper values of A. In this case, 
the matrix W is not orthogonal or unitary. Nevertheless, it is still true 
that 

(4.12.2) A" = TPA'TP- 1 , 

where v is any integer if A is nonsingular or any non-negative integer when 
A is singular. Moreover, 

(4.12.3) p(A) = Wp(A)W~\ 

where p(X) is any polynomial. Hence let x = Wy be any vector. Then 

A"x = 



If there is a single proper value of largest modulus, let this be XL Then 
just as in the Hermitian case for large v 

(4.12.4) x v - A"x = XjT/iWi, 

provided only rn ?* 0. 

For a non-Hermitian matrix, there are two sets of proper vectors, a set 
of row vectors and a set of column vectors. Corresponding to any proper 
value X there is a column vector Wi and a row vector w* such that 



(4.12.5) Awi \iWi, w l A = 

The w i are in fact the rows of W~ l , as the ; are the columns of W, or may 
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be taken so when properly normalized. Let u = vW~ l be any row vector. 
Then 

uA> = 



if the fa are the elements of v. Hence, again, for sufficiently large v 
(4.12.6) u v = uA> = 



provided <i ^ 0. It is not necessary that Wi and w 1 be of unit length, but 
for these to be a column of W and a row of W~ l , respectively, it is neces- 
sary that 

W l Wi = 1. 

For sufficiently large v, consecutive vectors in the sequence x, and 
also consecutive vectors in the sequence u v are approximately linearly 
dependent. Hence the ratio of any two corresponding components can 
be used to provide an approximation to Xi, and the agreement among 
these ratios provides evidence as to the nearness to the limit. The 5 2 
process can be applied to accelerate the convergence to all limits, Xi, Wi 9 
and w 1 . 

Suppose now that Xi and X 2 are equal in modulus, but exceed in modulus 
all other proper values. Then in the limit 



u v = 

Consider the sequence of scalars 
(4.12.8) a v = ux v = u v x 

For large v this becomes 

Oi v = 

Then in the limit the matrix 




is singular. If either w or # is replaced by a different vector, another 
sequence of scalars /3, can be denned, and by a familiar argument Xi and X 2 
must satisfy the quadratic equation 



(4.12.9) 



0, 



X 

X 2 



,-+i 



= 0. 



The a's and the 0's can indeed be individual components in the z's or in 
the w's. The extension to a larger number of proper values of equal 
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modulus is direct, and applies, moreover, also to the case of moduli that 
are nearly equal. 

In case Xi = X 2 , the coefficients in the quadratic (4.12.9) all vanish. 
This case will be considered later. 

Suppose the proper value Xi of greatest modulus and its associated 
vectors w\ and w l have been found. Suppose X2 exceeds all other proper 
values in modulus. Its proper vector w z is orthogonal to to 1 , and w* is 
orthogonal to w\. Hence one can proceed as above but with starting 
vectors x, orthogonal to to 1 , and u, orthogonal to w\: 



w l x = uwi = 0. 
Or, one can replace the matrix A by 

A i = A \iw\w 1 , 

whose proper values are 0, X2, X 3 , . . . , X n . 

Turning now to the case of multiple values, it may or may not be 
possible to diagonalize the matrix when such occur. If Xi = X 2 5^ X for 
i > 2, and if two independent proper vectors exist associated with Xi, 
then in general a starting vector x will have some component which is a 
linear combination of these vectors. The iteration can proceed and will 
yield Xi and some linear combination of Wi and w? 2 . Likewise a starting 
vector u will yield Xi and some linear combination of w l and w z . A differ- 
ent starting vector x will yield a different linear combination of Wi and 
w 2 , and a different starting vector u a different linear combination of w l 
and w*. When powers A" of the matrix are computed explicitly, then 
one has the effect of iterating simultaneously upon n distinct column 
vectors, the columns of A, and upon n distinct row vectors, the rows of A. 
In the limit if corresponding columns (or rows) in consecutive powers A " 
are linearly independent, then the largest proper values are equal, or 
nearly equal, in modulus. But if these are linearly dependent, whereas 
the matrix A" has rank > 1, then the largest proper value is a multiple 
root. 

However, in the case of a multiple proper value, it can happen that 
the number of linearly independent proper vectors associated with this 
proper value is less than the multiplicity. If so, then the matrix cannot 
be diagonalized. The general form was discussed in 2.06, and is shown 
in (2.06.20), (2.06.21), and (2.06.22). 

It is clear that among the iterates A v x of any vector x only a finite 
number can be linearly independent. There will be some m < n such 
that for v > m, A'x is expressible as a linear combination of x, Ax, . . . , 
A m ~ l x. Hence all iterates lie in some subspace of dimension m < n. 
In this subspace there is at most one (in general there will be exactly one) 
proper vector associated with each distinct proper value. The fact can 
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be deduced from the discussion in 2.06, but will be brought out more 
clearly in 4.23. Furthermore, associated with each A, there is a principal 
vector of highest grade in the subspace, say t*}" of grade n<, the vectors 



u <*<- = (A - Ai/)X n<) = (A - \il)u ( ^- l \ 

are of progressively lower grade, and lie also in the subspace, and in par- 
ticular u ( f } is the unique proper vector in the subspace. 

Consider the effect of the iteration upon these principal vectors. If 
we write 

then one verifies that 

A Ui = UiAi, 
where 




the matrix A being of order n^ Hence 

A*Ui = AUtA* = UiAj = Ui(\il + /i) 2 , 
and in general 

A"Ui = U<(\J + /i) y . 

The auxiliary unit matrix /i vanishes in the n t th and higher powers. 

Associated with each distinct A t - will be a particular matrix Ui of 
principal vectors, and x can be expressed as a sum 

x = ZC/z (i) , 

where # (i) is a vector of as many elements as there are columns in Ui. 
Hence 

A'x = 



If there is a proper value Ai of modulus exceeding all other |A,|, then 
in the limit 

x, = A'x = Vi(\J + IJ'xW. 
Since 



(A!/ + 7i)> - niXi(Xi/ + /i)*- 1 

+ XJ'/ = [(\J + 70 - Xi/l> = /r = 0, 

it follows that in the limit any 1 -f- fl-i consecutive vectors in the sequence 
A'x will be linearly dependent, and in fact the coefficients expressing the 
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dependence relation are the coefficients of the powers of X in the expan- 
sion of (X Xi) ni . This fact provides a means for computing XL 
Given Xi, it is possible to form the combinations 



= x v+ i 
= x v+z 



We find 



and more generally 



But since 7" 1 = 0, therefore &\x y = 0. Moreover, 7* 1 " 1 differs from the 
null matrix only in the last element of the first row. Hence in the product 
t/i/i 1 " 1 only one column is non-null, and this is the proper vector u w . 
Hence A^ 1 - 1 :^ is equal to w (1) except for a nonessential" scalar multiplier. 
Thus even in this case, it is possible to obtain from the iteration both the 
largest proper value and an associated proper vector. 

4.2. Direct Methods. By a direct method will be meant a method for 
obtaining explicitly the characteristic function, the minimal function, or 
some divisor, possibly coupled with a method for obtaining any proper 
vector in a finite number of steps, given the associated proper value. The 
method to be used in evaluating the zeros of the function is left open. 

Naturally one such method would be direct expansion of the determi- 
nant \A X/| to obtain the characteristic function. This done, and the 
equation solved, one could proceed to solve the several sets of homogene- 
ous equations (A X/)x = 0, where X takes on each of the proper values. 
Such a naive method might be satisfactory for simple matrices of order 2 
or 3, but for larger matrices the labor would quickly become astronomical. 

In discussing iterative methods, it was convenient to consider sepa- 
rately Hermitian and non-Hermitian matrices. This was primarily 
because of the fact that for Hermitian matrices the proper values are 
known to be real, though a further point in favor of the Hermitian matrix 
is the fact that it can always be diagonalized. For the application of 
direct methods, however, the occurrence of complex proper values intro- 
duces no difficulty in principle, though naturally they complicate the 
task of solving the equation once it is obtained. Intrinsic difficulties 
arise in the use of direct methods only with the occurrence of multiple 
proper values. Fortunately, however, one begins in the same way 
regardless of whether multiplicities are present or not. If present, the 
fact reveals itself as one proceeds. If a given direct method applies at 
all to non-Hermitian matrices, then its application to any diagonalizable 
matrix can be discussed about as easily as can the application to the 
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special case of a Hermitian matrix. Consequently, each method or 
class of methods will be described for whatever cases it may cover, 
Hermitian or not, before passing on to another. 

One ingenious and simple method that has been used to find the 
characteristic equation may be mentioned at the outset. This is to 
evaluate the determinant <f>(\) = |A X7| for each of n + 1 selected 
values of X, and then write the interpolation polynomial of degree n 
which they determine. This might be advantageous when the matrix is 
small, and values of X could be found for which the evaluation is especially 
simple. However, it provides no assistance for the computation of the 
proper vectors. 

4.21. Symmetric Functions of the Roots. If we write 

(4.21.1) /(X) m (-)N(X) = X - 71X"- 1 + 7 2 X*~ 2 + + (-)7 n , 

then the coefficients 7^ are the "elementary" symmetric polynomials in 
the proper values. That is to say, 7^ is the sum of the products h at a 
time of the X. In particular 

(4.21.2) 71 = 2X, = tr (A). 

Newton's identities (3.02.5) express the sums of the powers s, as poly- 
nomials in these elementary polynomials, where the 7, here take the 
places of the <7 in (3.02.5). But 

Si = 71 = tr (A), 
s 2 = SX? = tr (A 2 ), 
and in general 

(4.21.3) s h = SXJ = tr (A*). 

Hence by taking powers of A up to and including the nth, one can com- 
pute the sums of powers of the X,, and thence by applying Newton's 
identities find the 7^. Hence to find the Sh by this method requires 
(n 1) matrix products; each matrix product requires n 3 multiplications; 
hence altogether to find the coefficients in (4.21.1) requires approxi- 
mately n 4 multiplications. 

To improve the algorithm and obtain further information, consider 

(4.21.4) C(X) = adj (X7 - A) - CoX- 1 - CiX- + C 2 X- 3 

Then 

(4.21.5) C(X)(X7 - A) - (- 

On expanding and comparing coefficients of X on the two sides of this 
equation, one finds 
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Co-/, 

Ci + C A = 7i7, 



(4216) 

' C n _i + C_ 2 A = 7n_l7, 

Cn-lA = 7n7. 

Now 71 is given by (4.21.2). Hence Ci can be found from the second of 
(4.21.6). On multiplying this by A and taking the trace of both sides, 
one finds in view of (4.21.3) 

tr (CiA) -f s z 7ii. 

j 

Comparison with the second of (3.02.5) shows that 

272 = tr (CiA). 

Hence 72 can be found and therefore, by the third equation, C*. In 
general 

(4.21.7) C h = 7*7 - 7 *-iA + 7>- 2 



On multiplying by A, taking the trace, and comparing with (3.02.5), one 
finds that 



(4.21.8) fry* = tr 

Hence the coefficients 7^ and the matrices C can be obtained in the 1 
sequence Co = /, 71, Ci, 72, C 2 , . . . , y n - The final equation in (4.21.6) 
serves as a check. Note that, since each Ch is a polynomial in A, it is 
commutative with A : 

AC h = C h A. 

As a byproduct of this computation, one obtains the determinant 

\A\ = 7n, 

the adjoint 

adj A = Cn-i, 
and the inverse 

A- 1 - C n _!/7. 

If C(X t ) 7^ 0, then any non-null column of C(X,-) is a proper column 
vector, and any non-null row of C(X) is a proper row vector associated 
with Xf, since by (4.21.5) and the commutativity of the Ch and A we have 

C(X t -)(X<7 - A) (Xi/ - A)C(Xi) - /(X,)7 = 0. 

But if X is a simple root, then necessarily C(X,-) ^ 0, and there is at 
least one non-null column and at least one non-null row. For suppose 
C(X t -) = 0. By differentiating (4.21.5) and setting X = X,, one obtains 
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and on taking determinants of both sides, 



But <(At) = 0, whereas <'(X,-) s** 0, and we have therefore reached a 
contradiction. Hence the method yields at least those proper row and 
column vectors that are associated with simple proper values. 

In forming the matrices C^ Ca, . . . , C n -i, each requires n 3 multipli- 
cations, making n 3 (n 2) in all for forming the characteristic function. 
Given a X,-, to form C(X t ) one can form 

CoX<, (C X< + Ci)X<, (C X? + dX< + C 2 )X<, 

which requires n z (n 2) multiplications for each X, and hence again 
n*(n 2) multiplications when the X,- are all distinct. Altogether this is 
2n 3 (w 2) multiplications. However, if one forms only a single row 
and a single column, and these are non-null, this can be reduced to a total 
of n z (n 2 -4). 

Suppose next that X,- is a double root, but not a triple root. It may 
still happen that C(X) ^ 0. If so, then again any non-null column is a 
proper column vector, and any non-null row a proper row vector. Since 
by differentiation of (4.21.5) 

(4.21.9) C(X) + (X7 - A)C'(\) = (-l) 

and since X is a double root, therefore 



and certainly therefore C"(X;) ^ 0. However, 

(X7 - A)C(\) + (X7 - A) 2 C'(X) = (-l)VW(X/ - A), 
and therefore 
(4.21.10) (\J - AYC'M = 0. 

Hence there is at least one non-null column x and at least one non-null 
row u of C"(X), and 

u(\il - AY = 0, (X,/ - AYx = 0, 



whereas 

u(\J - A} ^ 0, (X,7 - A)x ^ 0. 



Thus u and x are principal vectors of grade 2 associated with X. 

Suppose, on the other hand, that C(X t ) = 0. Then by (4.21,9) it 
follows that 



(4.21.11) (X<7 - A)C'(&) = 0, 
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since X, is a double root. Then any non-null column of C"(X,), if there is 
such, is a proper column vector, and any non-null row a proper row vector. 
We can show that C'(\i) has rank 2. 

We know from Chap. 2 that a double root has associated at most two 
proper vectors. Hence (X t / A} has rank n 2 at least. But 
= C(X,) = adj (X7 A) so that (X7 A) has rank n 2 at most. 
Hence in this case (\J A} has rank n 2 exactly. Hence by (4.21.11) 
C'(X t -) has rank 2 at most. Let B be a constant matrix of maximal rank 
such that 

= 0. 



Since C'(X) has rank 2 at most, B has rank n 2 at least. If we differ- 
entiate (4.21.9), set X = Xj, and multiply by B, we obtain 



The rank of the right member of this equation is the same as the rank of 
B, since <f> ff (\i) ^ 0, and this cannot exceed the rank of any matrix factor 
on the left. But (X t 7 A) has rank n 2, whence the rank of B cannot 
exceed n 2. Hence B has rank n 2 exactly, and therefore C"(X,-) has 
rank 2 exactly. 

Thus when X, is a double root (and not a triple root), either C(X) 7* 0, 
in which case there exists a non-null column of C(X t -) and a non-null 
column of C'(X t ), the first being a proper, and the second a principal, 
column vector associated with X*; or else C(X;) = 0, in which case C'(X t ) 
has rank 2 and any two linearly independent columns are proper vectors. 
Corresponding statements can be made for the rows. 

The argument can be extended to the case of a root of arbitrary 
multiplicity. 

4.22. Methods of Enlargement. Suppose A n -i is a principal minor of 
A, say that taken from the first n 1 rows and columns of A, and let 

(4.22.1) A = A n = ( A , n ~ l a *~ l \ 

\a n -i <*n-i/ 

Then 

(4.22.2) \I n -A n = (* In ~ l 7 An ~ l . ~ an ~ l \ 

\ fl n -i X a n -i/ 

Let 

(4.22.3) n (X) = |X/ n - A n \, 
and 

(4.22.4) B n (X) =. adj 
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Here the prime does not denote differentiation, but is merely a dis- 
tinguishing mark. Note in particular that 



Now </> n -i(X) is of degree n I in X. However, each element of 
/ n _i(X) and each element of /i_i(X) is of degree n 2 at most. Hence if 
we note that 

(X/ n -A n ) 

or in more detail 



(4 22 (ft n-l An_l)/n-l(X) a n -l<n-l(X) = 0, 

^ ' ' - (X - n 



it follows that, when < n _i(X) is known, all coefficient vectors of / n -i(X) 
can be obtained by comparing coefficients, and hence < M (X) can be formed 
from the last equation. In fact 

X/_i(X) = An_i/n_l(X) -f a n _l<n-l(X), 

so that beginning with the coefficient of X n ~ 2 in / n _i(X), which is simply 
On-i, the vector coefficients can be obtained in sequence. 

If one first forms <i(X) = X an, one can then by this scheme form 
/i(X) and hence ^(X); then /2(X), and hence 4>a(X), . . . , eventually 
obtaining < n (X). 

Given < n _i(X), the product A n _i/ n _i(X) requires (n I) 3 multiplica- 
tions; a n _i$ n -i(X) requires (n I) 2 ; a^Jn-iW requires (n I) 2 ; and 
a_i^_i(X) requires n 1. Altogether this is n 2 (n 1). When they 
are summed over all values from 2 to n, we have a total of 

n(n 2 - l)(3n + 2)/12 

multiplications, or approximately n*/4. The advantage over the other 
method lies in the fact that not the entire adjoint but only one column 
of it has been computed. 

However, this also accounts for a major disadvantage, when proper 
vectors as well as proper values are needed. If for any X, the vector 
/n-i(X,-) and the scalar </> n -i(X) do not both vanish, then a proper vector 
associated with X t - is 



\ 

y 



Undoubtedly this covers the majority of cases that arise in actual practice. 
But this column alone is insufficient for obtaining all the proper and 
principal vectors when X t - is a multiple root, and moreover no general 
method has been provided for the theoretically possible case of a simple 
root X< for which this column (but not the entire adjoint) vanishes, 



172 PRINCIPLES OP NUMERICAL ANALYSIS 

The proper row vectors can be obtained in the "usual" case by using 
equations corresponding to (4.22.5) to compute /i_i(X), and hence 



The "escalator method" also proceeds to matrices of progressively 
higher order, but it requires the actual solution of the characteristic equa- 
tion at each stage. Consider only the case of a symmetric matrix, for 
which moreover all proper values are distinct. Thus suppose for the 
symmetric matrix A all proper values and all proper vectors are known: 

(4.22.6) AU - t/A. 

If A is bordered by a column vector and its transpose to form a symmetric 
matrix of next higher order, we wish to solve the system 





Hence 

+ r - \y, 



(4.22.8) 

* = XT;. 



Let 

(4.22.9) y = Uw. 

Then 

AUw H- arj = \Uw, 
Uhw + 077 = \Uw y 

(4.22.10) Kw + U^aij = \w, 

w = (X7 - A)- 



Since ij can be taken as an arbitrary scale factor, (4.22.9) and (4.22.10) 
determine the proper vector once the proper value X is known. How- 
ever, from the second equation (4.22.8) it follows that 

a*Uw = (X )?, 
and, on substituting (4.22.10), 

(4.22.11) a?U(\I - A)~ l E/ T a = X - a. 
This equation in scalar form can be written 

(4.22.12) 2(o T tii)V(X - X,) = X - a, 

and its roots are the proper values of the bordered matrix. If the X,- are 
arranged on the X axis in the order \\ > \z > > X n , then as X 
varies from < to X n , the left member decreases from to < ; as 
X varies from X n to X n _i, the left member decreases from + to ; 
. . . ; as X varies from Xi to + oo , the left member decreases from + to 
0. The right member increases linearly throughout. Hence (4.22.12) 



THE PROPER VALUES AND VECTORS OF A MATRIX 1/o 

has exactly one root X < X n ; exactly one root X between each pair of 
consecutive X,; and exactly one root X > XL With the roots thus iso- 
lated, Newton's method is readily applied for evaluating them. 

4.23. Finite Iterations. If 6 is an arbitrary non-null vector, then in 
the sequence 

bo, 61 = Ab 0t & 2 = Abi, . . . , 

at most n of the vectors are linearly independent. Suppose the first 
m < n of the vectors are linearly independent, but the first m -f 1 are 
linearly dependent. Then b m is expressible as a linear combination of the 
other m, and hence for some scalars ft it is true that 

(4.23.1) b m - fcbn-i + 2 6 W _ 2 - 0J>* = 0. 
Hence 

(4.23.2) (A> - ftA"*- 1 + /M w ~ 2 - ftn/)6o - 0, 
which is to say that 

(4.23.3) p(A)b<> - 0, 
where 

(4.23.4) p(X) = X- - ftX 1 + ft*. 

If 

d(X) = X" + SiX"- 1 ++ 

is the highest common divisor of p(X) and iA(X), then d(A)b^ = 0. That 
is to say 

(A' + M"- 1 * + *,/)6o = 0, 
or 

b v + !&,_! + + a,6 = 0. 

But the vectors &o, 61, . . . , & m -i are linearly independent, whence v = m 
and d = p. Hence p(X) divides the minimal function \^(X) and hence 
also the characteristic function <(X). In particular if m = n, then 
p(X) ( l) n <(X). When this is true, therefore, one can form the 
characteristic equation by first performing the n iterations Abi and then 
by solving a system of n linear equations. When this is not so, one can 
at least obtain a divisor of the characteristic function by performing a 
smaller number of iterations and by solving a system of lower order. 
However, in addition one must test at each step the independence of the 
vectors already found, as long as the number is below n + 1. 

A great improvement is provided by Lanczos's "method of minimized 
iterations." Let &o and Co be arbitrary nonorthogonal non-null vectors. 
In case A is symmetric, take 60 Co; if A is Hermitian, take 6 = c and 
in the following discussion replace the transpose by the conjugate trans- 
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pose. In other cases 60 and Co may or may not be the same. Form b\ 
as a linear combination of 60 and Abo, orthogonal to c ; and Ci as a linear 
combination of CD and A T co, orthogonal to bo. Thus 



61 = Abo 
where 

= cjbi cjAbo a cjb ; 
and 



c\ = 
where 

= c}b = cJAbo $ocjb . 
But then 

OLQ = 60 = cJAbo/cJbo. 

Next, choose 62 as a linear combination of 60, bi, and Abi, orthogonal to 
both Co and Ci ; and c 2 as a linear combination of c , Ci, and A T Ci, orthogonal 
to both b and bi. Then 



bz = Abi 
where 

= cjb 2 = cJAb! - /3ocjb , 
= c{b2 = cJAbi ctic[bi. 
Hence 

i = c\Abi/c[bi, /So = 

But from the relations already derived, 

cJAbi 
Hence 

j8o = cjbi/cjbo, 
and if 

c{ = c[A 
then it follows that 

= 



The step breaks down in case cjbi = 0, since this is the denominator in 
i. But this means that 



= cJAbi = c[A(A 

and when o is replaced by its value, this is, apart from the factor (cjbo)" 1 , 
equal to the determinant of the matrix product 



Hence this can vanish only if the pair b , Abo or else the pair Co, A T Co is 
linearly dependent. Suppose for the present that this is not the case. 
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We now show that 2 and 0i can be chosen so that 

63 Ab% a^bz Pibi 
is orthogonal to all three vectors c , Ci, and c 2 and so that 



is orthogonal to bo, 61, and 6 2 , provided only that the vectors bo, Abo, 
A*bo, and also the vectors c , A T c , A Jt c , are linearly independent triples. 
First we note that 

clAbz = (c\ + aocDbz = 0, 

c]Ab Q = c\(b\ -f ao&o) = 0, 



so that Cs is orthogonal to 60, and 63 to Co, independently of 2 and ft\. 
Next 

c{A6 2 = (cj + ic{ 
Hence 

c\b 3 = 

whence Ci and 63 are orthogonal if 

0i = 
But in this event 



- jS^bi = 0. 
Finally, 



and both vanish provided 



This is always possible if c 2 and & 2 are not orthogonal. But the matrix 
product 



\ 

I 

/ 



jA 3 &o 1 



has a determinant which by successive reductions of rows and columns 
yields 



Since we have assumed (c{6 )(clbi) ^ 0, it follows that cj6 2 = if and 
only if either the triple c , A T c , A TJ c or else the triple 6 , A6 , A 2 6 is 
linearly dependent. 
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We can proceed inductively, forming 



(4.23.5) 

where 

(4.23.6) a t 

until possibly at some stage c t and b t are orthogonal. When c,- and b- are 
not orthogonal, then b,-+i is orthogonal to every vector c , Ci, . . . , d, 
and d+i is orthogonal to every vector 60, bi, . . . , b,-. 

Necessarily, there is a smallest m < n f or which c]Jo m = 0, since if 
the vectors Co, c\, . . . , c n _i are all linearly independent, then only 
b n = is orthogonal to them all, and this vector is orthogonal to c n , 
whatever c n may be. Suppose the relation holds for some m < n. Then 
either the set bo, Abo, - . . , A m b<> or else the set Co, A*CQ, . . . , (A T ) m Co 
is linearly dependent. For definiteness suppose it to be the former. The 
set from which A m bo is omitted is linearly independent, for if it were not, 
the m selected would not be the smallest possible. Hence A m bo is 
expressible as a linear combination of the m vectors b , . . . , A m ~ l b<>. 
Hence Ab m -i is some linear combination of the vectors bo, bi, . . . , b m _i: 

Ab m -l = Mm-lbm-l 

Hence 



-1 ~ 0!m-l)b m _i 

But then 

= cjb w = 
= c\b m == 



= cjj.jb^ = ( 

By hypothesis all the vector products on the right are non-null, and 
therefore 

= MO = Ml = ' 

whence b m = and 

(4.23.7) - (4 - a-! 



Thus if the vectors bo, Ab 9t . . . , A m b are linearly dependent, then 
(4.23.7) holds; correspondingly if the iterates of Co by A T are linearly 
dependent, then it follows that 



(4.23.8) = (A T - a m _ 1 /)c m _ 1 - w _ 2 c m _ 2 . 

Hence the recursion (4.23.5) and (4.23.6) can be continued until for some 
i = m either (4.23.7) or else (4.23.8) holds. 
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Consider now the sequence of polynomials 

po(X) - 1, 

pi(X) = (X - ao)po(X), 

(4239) P2 ^ = ( X ~ i)Pi( x ) ~ 0opo(X), 

P<+i(X) = (X - 



where the a's and 0's are defined by (4.23.6). One verifies inductively 
that 

(4.23.10) pi(A)6 = b i} p<(A T )c = c t . 

Hence either p m (A)bo = 0, or else p m (A T )co = 0. In either case p m (X) is 
a divisor of the minimal function ^(X), its coefficients are provided with- 
out the necessity for solving a system of equations, and moreover the test 
for dependence of the successive sets of iterates of 60 and of Co is performed 
automatically in the course of the computation. 

Suppose now that the proper values are all distinct. Then if W is the 
matrix of proper vectors, 

.(4.23.11) W~ 1 AW = A, 

and A is diagonal. Also 

(4.23.12) W-*pt(A)W - pi(A). 

If only one of b m and c m vanishes while the other does not, one can by 
a different choice of b Q (if b m = 0) or of c (if c m = 0) obtain a longer 
sequence. Hence suppose that b m = c m = 0. Moreover p m (\) has only 
simple zeros : 



(4.23.13) p(X) = p m (\) = (X - Xi)(X - X 2 ) (X - X 
Then 

(4.23.14) q t (\) - p(X)/(X - X) 

is a polynomial, and there is no nonconstant factor common to all the 
<. Hence by a theorem in 2.06 there exist polynomials /,-(X) such that 



Hence 

(4.23.15) If t (A)q t (A) - / - 
and therefore 

(4.23.16) 2ftCA)fc(A)&o - bo, S/(A T )^(A T )c = c . 
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Also, since polynomials in A (or those in A T ) are commutative with one 
another, 

S/X^W&y = by, SA.(A T )<Z,(A T )Cy = Cy. 

But 

(A - WfiWtoWbj = MA)p(A)bi = 0. 



Hence fi(A)qi(A)bj is a proper vector of A associated with the proper value 
X, so that each by is expressed as a linear combination of proper vectors. 
Again, if the first of (4.23.16) is solved for any proper vector 



it is expressed as a linear combination of vectors bo, A b , A 2 bo, . . . , and 
these in turn are expressible as linear combinations of the by. Hence 
each proper vector appearing in the first of (4.23.16) is expressible as a 
linear combination of the fry. Likewise each proper vector appearing 
in the second of (4.23.16) is expressible as a linear combination of the 
Cy. Let Ui represent the proper vectors of A, y t the proper vectors of 
which appear in (4.23.16). Then 



(4.23.17) 6 = 2^, Co = 

Since the Ui and Vi are proper vectors, therefore 



(4.23.18) 

Cy = 

Hence if we let 

(4.23.19) 



(i.23.20) f"< 6 V' = ( ; c v 

U - (Ui ' ' U m ), V = (Vi ' ' ' V m ), 




then (4.23.18) can be written 

(4.23.21) B = UP, C = FP. 

But 

F T C7 = Z), 

where D is a diagonal matrix. Hence 

V J B = DP, C T C7 = P T Z>. 
But we already know that the u's can be expressed in terms of the 6's, and 
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the v'a in terms of the c's: 

U - BH, V T = 
Moreover, 

= A 



is also a diagonal matrix. Hence 

C T C/ = AH, 
H = A-^t/ = A-IP T D, 

(4.23.22) U = BA-^D, F T = 

The diagonal matrix /) is determined only by scale factors in the vectors 
Ui and Vi and these can be chosen as convenient. Thus the m proper 
vectors of A which appear in (4.23.17) can be expressed as linear combina- 
tions of the bi, and the m proper vectors of A J (or proper row vectors of A) 
can be expressed as linear combinations of the c (or of the cj). 

When m < n, in general this will be because </>(X) has multiple zeros. 
Suppose still that p m (X) has only simple zeros and take a 6' orthogonal to 
all the Cj. Then Ab' Q is also orthogonal to all the c/, since 



and A*CJ is itself a linear combination of the c's. Likewise if c' Q is orthog- 
onal to all the b's of the original sequence, so also will AV be orthogonal 
to all the 6's of this sequence. Hence one can develop sequences b( and e<, 
and these vectors will be independent of those hitherto found. They 
will yield new proper vectors associated with the multiple roots. More- 
over, the new sequences will in general have fewer than m members each. 
If they have fewer than n m members, a third pair of sequences must 
be started with b" orthogonal to all previous c's, and c' ' orthogonal to all 
previous 6's. Since each such sequence will contain at least one member, 
the initial vector of the sequence, the process will eventually terminate 
and yield all proper vectors. 

If, instead of imposing the orthogonality requirement upon fej and c , 
one required only that b' Q ** &o and c' Q ?* Co, then in general new sequences 
of m terms each will result, proper vectors associated with simple roots 
will be found over again, but a new proper vector for A and one for A 1 will 
be found associated with each multiple root. This course might be 
preferred in order to avoid the computation of the orthogonal starting 
vectors. 

When the zeros of p m (X) are not all distinct, it is still possible to obtain 
the proper vectors in much the same way. A resolution such as (4.23.16) 
can be employed to show that bo and Co are expressible as sums of principal 
vectors. Here g^(X) is the quotient of p(X) by the highest power of 
(X Xi) it contains as a factor. Each principal vector which appears in 
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(4.23.16) is expressible as a linear combination of the b (or of the c<). 
Also, if Xi is the principal vector of A associated with X,-, then (A X/)a;,, 
(A \{I)*Xi, . . . can each be expressed as a linear combination of the b,\ 
But every vector of this sequence is a principal vector or vanishes, and 
one is a proper vector. Hence associated with each zero of p m (\) there 
is a proper vector of A which is a linear combination of the b's and a 
proper vector of A 1 which is a linear combination of the c's. Let u* be 
the p'roper vector of A associated with X,. Then 



Ui = } 



Hence 

But 

Cj = pj(A*)c 0) 
whence 



(4.23.23) c}ui = cl Pj (A)Ui = cjt 
since w is a proper vector. Hence 

(4.23.24) OH, = cJtti 
and therefore 

pj(\i)bj/c]bj. 






But cjttt is a scale factor which can be chosen arbitrarily. Thus if in 
(4.23.19) the matrix P is taken to be rectangular, one row for each X,, and 
the matrices U and V contain only the proper vectors, then (4.23.22) 
holds also in this more general case. Again if m < n, one can select 
vectors b' orthogonal to every c/, and c' orthogonal to every by, and form 
new sequences to provide any proper vectors not already found. 

4.24. The Triple-diagonal Form for a Symmetric Matrix. In the 
method outlined in the last section, consider again for the moment the 
case m = n. Let 

B = (b bi b n -i), C = (c Ci c n _i). 

Then from the defining relations we have C T # as a diagonal matrix D 
whose diagonal elements are 

(4.24.1) S Q = cjb * 0, . . . , 5n_i = cLi&-i ^ 0. 

Hence 

VB = D, 

C" 1 Z)- 1 J3 T . 
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Also we found that 



c}Abi-i = cl^Abi =* c]bi = 5<, 
clAbj = 0, |t - j\ > 1. 



Hence the product C*AB is a matrix for which the only non-null elements 
lie along, just above or just below, the main diagonal: 



5i 
(4.24.2) VAB = n 1 

^ ' I U 



3 



Hence 

ro 0o 
" l as 



since by (4.23.6) we know that t _i = 5</6i_i. One should expect that 
after reducing the matrix to this form a considerable step has been taken 
toward complete diagonalization. 

In case A is symmetric, if Co = 60, then C = B, and every 5 > 0. 
Hence D^ is a real diagonal matrix. If we set 



U = BD-x, 
then U is an orthogonal matrix, and 



(4.24.4) 8 - U*AU = ' 




For the case of a symmetric matrix, we now consider another method of 
obtaining the triple-diagonal form (4.24.4). 
Before doing so, however, consider the characteristic function. Let 

po(X) = 1, 
pi(X) = X 0:0, 



P2(X) = 


X a< 


' x -1 = (x " 


^i)pi(X) jSopo(X), 




X a.( 


) fa* 




p 3 (X) = 


-fatt 


X Q 1 (\ 


= (X 2)P2(X) jSlpl(X), 







-fa* X - 2 





Thus the polynomials p(X) are the expansions of the determinants of the 
first principal minors of the matrix X7 S. Note that pi+i(X) and p<(X) 
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cannot have a common factor, for if they did, this factor would be con- 
tained also in p_i(X), hence also in p;_ 2 (X), . . . , hence also in po(X), 
which is absurd. Also at any p for which pi(p) = 0, p+i(p) and p-i(p) 
have opposite signs, since each & > 0. 

We can show that between any consecutive zeros of p,-(X) there is a 
zero of p,+i(X), and further that p+i(X) has a zero to the left, and one 
to the right, of all those of p,(X). Hence between any two consecutive 
zeros df pt+i(X) there is exactly one zero of p(X). Since pi(ao) = 0, and 
since pz(ao) < 0, while p*( ) = + , the statement is certainly true 
for i = 1. Suppose it demonstrated for p 3 , p\, . . . , Pi+i and consider 
Pi+z> Let pi and p 2 be consecutive zeros of p,+i(X). Then 

Pi+z(pi) = te 



The hypothesis implies that p 2 , Pa, . . . , p%+i can have only simple zeros 
and that p, has one and only one zero between the consecutive zeros pi and 
p 2 of pi+i] hence p(pi) and Pi(pz) have opposite signs," and hence p,-+ 2 (pi) 
and Pi+z(pz) have opposite signs. Therefore pt+z has an odd number of 
zeros between pi and pa. 

Next suppose p is the greatest zero of pi+i. Then pi+z(p) = ftpt(p). 
But pi has no zero exceeding p, and p(<) +. Hence p(p) > 0, 
and therefore p+ 2 (p) < 0. Hence pi+z(p) has an odd number of zeros 
exceeding p. But pi+z is of degree i + 2. The hypothesis of the induc- 
tion implies that p, +1 has i -f 1 real and distinct zeros; these divide the 
real X axis into i segments and two rays extending to + > and to oo t 
respectively. We have shown that each segment and one of the rays 
each has on it an odd number of zeros of pi+z. Hence each can contain 
only one, and the remaining zero lies on the other ray. 

Thus the polynomials p,(X) have all the properties required of a Sturm 
sequence, as in 3.05, though they are not formed in the same way. 
Hence by counting the number of variations in sign exhibited by the 
sequence p*(X) at each of two values of X and by taking the difference, one 
has the exact number of proper values of the matrix A contained on the 
interval between these values. 

Now return to the symmetric matrix A and consider, db initio, the 
problem of reducing it to a triple-diagonal form. Suppose A is 3 X 3. 
Then if i 2 ^ 0, one can find an orthogonal matrix of the form 



(4.24.6) U = 



where c and 8 are the sine and cosine of some angle, such that in the trans- 
formed matrix 

A' - WAV 
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the elements a' 18 = a' n = 0. In fact, one finds 

(A f)A >7\ If 

V*'*^'t) 0! 18 = 81 = COilZ S12, 

and hence one has only to choose 

(4.24.8) K = ai3/ 12 , c = (1 + * 2 )-*, 8 = CK, 

which can always be done if i 2 ?* 0. It turns out then that 



so that as a result of the transformation 

0. 

For an arbitrary symmetric matrix A, let it be partitioned 
(4.24.9) 

where An is of order 3. Suppose i 2 7* 0. Let 
(4.24.10) 

where U is the orthogonal matrix of order 3 denned by (4.24.6) and 
(4.24.8). Then V is orthogonal, and 

(4.24.11) 

\ n 

Hence the matrix V transforms the symmetric matrix A, arbitrary except 
that ori2 T* 0, into a matrix A' in which a' 13 = 0. Moreover, the trans- 
formation leaves unaffected all elements in the first row of A^, and in the 
first column of A 2 i, as well as the entire submatrix ^4 22 . 

Now if in A' the element a' 14 ?^ 0, one can interchange the third and 
fourth columns and the third and fourth rows, apply a similar trans- 
formation, interchange again, if desired, and obtain a matrix A" in which 

'/3 = i'l = <4 = ' 4 'l = 0. 



By continuing in this fashion, all elements but the first two in the first 
row and all elements but the first two in the first column can be caused to 
vanish. 

Having achieved this, one can now operate with the submatrix of order 
n 1 obtained after leaving out the first row and the first column of the 
matrix. Eventually, therefore, one obtains the required triple-diagonal 
form. When the characteristic equation is solved, the proper vectors 
associated with each proper value are obtained by a direct solution of the 
homogeneous equations. 
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If it should happen that any fr = on the off diagonals, then it is 
sufficient to consider separately the characteristic equation and proper 
vectors of the submatrix above and to the left of the vanishing ft and of 
that below and to the right. 

4.3. Bibliographic Notes. A good development of the topics outlined 
in 4.0 can be found in M acDuffee (1943) . Of the many papers on bounds 
see especially Taussky (1949), Brauer (1946, 1947, 1948), Parker (1951), 
Ostrowski (1952), and Price (1951). 

Papers by Aitken (1931, 1936-19376) are classic. A more recent gen- 
eral discussion of iterative methods is by Semendiaev (1950). Barg- 
mann, Montgomery, and von Neumann (1946) discuss the use of the 
trace of powers of A and consider in particular the prevention of "over- 
flow" in spite of round-off. The use of polynomials to accelerate 
convergence was proposed by Flanders and Shortley (1950). On the 
use of transformations and orthogonalization for successive proper values 
see Hotelling (1943) and Feller and Forsythe (1951). Kohn (1949) 
describes without proof an iteration which converges to an arbitrary 
(but random) proper value. 

Aitken uses the A operator for obtaining successive proper values. 
On equal and nearly equal roots see Rosser, Lanczos, Hestenes, and Karush 
(1951). Rotational diagonalization, for which equality or near equality 
is no difficulty, was used by Kelley (1935), but is in fact much older. 
The method was discussed by Goldstine in August, 1951, at a symposium 
at the Institute for Numerical Analysis and is to be published with 
detailed error analysis in a paper by Goldstine, Murray, and von Neumann. 

The method of 4.21 is due to Frame (1949, and an unpublished paper). 
It was also published by Fettis (1950), but without detail or considera- 
tion of multiple roots. The recursion defined by (4.22.5) was given by 
Bryan (1950). On the escalator method, see Morris and Head (1942), 
and for a more general (brief) treatment, Vinograde (1951). Lanczos 
(1951a and 19516) gave the method of minimized iteration. The triple- 
diagonal form with a valuable treatment of error is discussed by Givens 
(1951 and a forthcoming memorandum). 

That the polynomials PJ(X) form a Sturm sequence is a classical result 
(see Browne, 1930). 

A method for the simultaneous improvement of approximation to all 
proper values is given by Jahn (1948) and Collar (1948). 



CHAPTER 5 
INTERPOLATION 



5. Interpolation 

This book falls naturally into two parts: one part dealing with the 
solution of equations and systems of equations, the other part dealing 
with the approximate representation of functions. We come now to the 
second part. 

It may be that a function or its integral or its derivative is not easily 
evaluated; or that one knows nothing but a limited number of its func- 
tional values, and perhaps these only approximately. In either case one 
may require an approximate representation in some form that is readily 
evaluated or integrated or otherwise manipulated. If one knows only 
certain functional values, which, however, are presumed exact, the 
approximate representation may be required to assume the same func- 
tional values corresponding to the given values of the argument. The 
problem is then one of interpolation. If the given functional values 
cannot be taken as exact, then a somewhat simpler representation will be 
accepted, and one which is not required to take the same, but only 
approximately the same, functional values. This is smoothing or curve 
fitting. Even a function that is easy to evaluate may not be easy to 
integrate. For approximate quadrature, therefore, the usual method 
is to obtain an approximate representation in terms of functions that 
are easily integrated. 

In general, the method is to select from some class of simple functions 
<{>(x) a limited number, 0o(z), <i(#), . , <n (#), and attempt to approxi- 
mate the required function f(x) by a linear combination of these functions. 
Thus we wish to find constants 7, such that the function 



(5.0.1) 

is in some sense a reasonable approximation to f(x) . If we agree to use 
n + 1 functions 4>, then we have n + 1 constants 7 at our disposal, and 
we can impose n + 1 conditions for their determination. In interpola- 
tion the conditions imposed are that f(x) and $(x) shall be equal at each 
of n + 1 distinct values x* of the abscissas. Thus we require that 

(5.0.2) J(xd = *(*) ( - 0, 1, . . . , n). 
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The points # on the axis of abscissas will be called the fundamental 
points. A particularly simple choice of functions <fc(z) for most compu- 
tational purposes is 

so that <(#) is a polynomial of degree n. 

Another possibility is to require that (5.0.2) shall hold for i 0, . . . , r } 
while for j = r -j- 1, . . . , n we require that 

/c n q^ (> (r-~\ diVf-^ 

^o.u.oj / \*i) ~ * \*i)' 

Iii this event all the Xj must be distinct, but they may coincide with some 
of the x^ We could equally well require that higher derivatives of / and 
<t> shall be equal for certain values of x, provided that altogether we have 
exactly n + 1 independent and consistent conditions imposed on the 7*8. 
Still other types of conditions may be used, and some will be discussed 
later. 

To return to (5.0.2), if we write 

(5.0.4) y { = f(xd 

for brevity, then Eqs. (5.0.2) can be written 
(5.0.5) yj 27;<fc(#/) . 

If the determinant 
(5.0.6) A = |fcfo)| 5*0, 

then these equations have a unique solution which can be written down 
on applying Cramer's rule. It is clear that (5.0.6) will not be satisfied 
if any two of the Xi are the same, since then two rows of the determinant 
would be identical. 

Equations (5.0.1) and (5.0.5) can be regarded as n + 2 homogeneous 
equations in the n + 2 quantities 1, 7o, 7i, . . . , 7n. Hence their 
determinant must vanish: 



(5.0.7) 

/ \ / \ / \ 

y 



= 0. 



This can be regarded as an equation in $(#). If we expand this determi- 
nant by elements of the last row, the last term will be A$(x), and every 
other term will be equal to some <>(#) multiplied by its cof actor, which is 
a constant. Hence when we solve for $(#), we shall have $(x) expressed 
as a linear combination of the functions <f>j(x) in just the form (5.0.1), as 
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required. Also if we set x = Xi in (5.0.7) and subtract row i + 1 from 
the last row, we get 

fe) - yd = 0, 



which shows that *(#,) = y iy and the values of $ at the points x t agree 
with those of /(#). We can write the solution of (5.0.7) in the form 



2/o 

(5.0.8) x ~, j. / \ j. /_ x . / \ ., 

. . . <t> n (x n ) y n 





If we expand the determinant on the right of (5.0.8) by elements of the 
last row, and divide by A, we obtain the form (5.0.1). Also we note that, 
if we expand along the last column and divide by A, we obtain a form 

(5.0.9) *(*) 2y<A,(aO, 

where each A t (x) is itself a particular linear combination of the fa(x) with 
coefficients depending only upon the Xk. Hence for a particular set of 
fundamental points the A can be calculated once and used for any f(x) . 
This would be useful, for example, if interpolations are to be made for 
each of several different functions, all of which are tabulated for the same 
values x k . 

Equation (5.0.9) exhibits an important property of interpolating func- 
tions: their linearity. Thus, to make explicit the fact that $ is the 
interpolating function for /(#), let us designate it <!>(/; x) and write 
(5.0.9) in the form 



(5.0.10) 

By the same rule, if g is any other function, its interpolating function is 



But then if X and ju are any constants, and h(x) \f(x) -f AH7(#), it 
follows that 



(5.0.11) *(fc; *) = Z[\ffa) 4- 

ss \<i>(/; x} -f n$>(g; x). 



It is understood that the basic functions fa and the fundamental points 
are fixed throughout. 

We may note further that, since by (5.0.7) $(fa', x) = fa, therefore 



n 



(5.0.12) *,(*) m fa(xi)&i(x) (j - 0, 1, . . . , n). 
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Suppose we require now that at x n the derivatives of * and / shall be 
equal: 



Then x n may or may "not coincide with one of the other a?,. We modify 
(5.0.7) in the next to the last line and write 



(5.0.13) 



y<> 
y f n 



= 0. 



Again the function $(x) which satisfies this equation is a linear combina- 
tion of the functions fa(x) which takes on the prescribed values #/ for 
j = 0, 1, . . . , n 1. Since (5.0.13) is an identity, we can differenti- 
ate, and only the last line is affected, all other elements being constant. 
Hence we see that $' takes on the prescribed value at jc n . If we require 
the derivatives of $ and / to be equal for any other z, we replace also 
that row of (5.0.13) by a row of derivatives, and again the equation 
defines the required function $. This procedure can be used for deriva- 
tives of any order where the solution (5.0.8) is modified only by replacing 
appropriate rows in the two determinants by rows of derivatives of the 
order required. The form (5.0.1) still holds, and the form (5.0.9) is 
modified only by the replacement of certain y t - by the value of the deriva- 
tive. The linearity property is unaffected. 

5.01. Some Expressions for the Remainder. The determinant 



(5.01.1) 



W(x) = 



is known as the Wronskian. If this remains different from zero every- 
where on the interval of interpolation (a, 6), one can define the linear 
operator L n +\ by the relation 



(5.01.2) 



(*0 



-f 



and the linear differential equation of order n 
(5.01.3) L n +M - 



is satisfied by each fa(x). Moreover every solution of (5.01.3) is expres- 
sible as a linear combination of the &. 
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In general, for v < n one can similarly define the linear operator L v +i, 
and the differential equation of order v + 1 

(5.01.4) Ln-iM - 



is satisfied by 0o, 0i, . . . , <f>, and every solution of (.01.4) is expressible 
as a linear combination with constant coefficients of these 0's. 

An equivalent definition of the operators L v+ i can be obtained as 
follows: Define the differential operator 

(5.01.5) D = d/dx, 
and select 60 so that 

(5.01.6) (D - 6 )0o(*) s 0, &o(*0 = 0J(aO/0o(aO. 
Then 

(5.01.7) LM m (D - 



since the two differential equations L[0] = and (D 6 )0 = are 
both satisfied by fa. Again let bi satisfy 



(5.01.8) (D - &!)(> - 6 ) i =0, bi = (0; - 6o0i)7(^ - MO- 
Hence 

(5.01.9) L,[0] m (D - bi)(D - 6 ) - (D - bJLMx)]. 
Proceeding sequentially, we define 62, 63, . . . , b n so that 

(5.01.10) L n+1 [0] = (D - b n ) - (D - 6 )0 = ( - fe)L n [0]. 



A generalization of Rolle's theorem can now be stated : If the functions 
bi(x) are all analytic on the interval (a, 6), and if <f>(x) is analytic, and 
vanishes n -f 2 times, counting multiplicities, then L+i[0] vanishes at 
least once on the interval. 

First consider any two consecutive zeros of 0, and define 

r r x "I 

\l/(x) = <t>(x) exp / bo(x) dx 

L Ja J 

Then 

t'(x) = Li[0] exp I / 

L Jo 



* 



Then ^ and vanish together, as do $' and Z/i[0]. By Rolle's theorem 
$' vanishes at least once between consecutive zeros of 0. By a simple 
extension of the argument, L 2 [0] vanishes at least once between consecu- 
tive zeros of Li[0]. Eventually we conclude that L n +i[0] vanishes at 
least once between consecutive zeros of L n [0], and hence at least once 
on the interval. 
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Define the function 



(5.01.11) g(x, s) - W~ l (s) 

> 

Then as a function of x, g(x, s) satisfies (5.01.3), and 

(5.01.12) d { g(x y s)/dx i 
Hence one can verify directly that 



X"" 



0, 

1, 



i = 0, 1, . . . , n 1, 

i n. 



(5.01.13) 



y(x) = 



satisfies the nonhomogeneous equation 
(5.01.14) L[y] = f 

for any constants a,. 

Any solutions &(#) of (5.01.3) with the non vanishing Wronskian could 
replace the fa in (5.01.11), and the same g(x,s) would result. This can 
be verified directly by writing 



and substituting into (5.01.11), in which case the determinant 
appears as a factor which cancels out. Otherwise one can observe that 
the initial conditions (5.01.12) define the solution g(x,s) of (5.01.3) 
uniquely. In particular the A(#) are linear combinations of the #, and 
hence satisfy (5.01.3), together with the conditions 



(5.01.15) 

If the Ai replace the fa in (5.01.11), one can write 

(5.01.16) g(x t s) = 
Note that 

(5.01.17) g(x jt s) - 
whence one can write 

(5.01.18) gr 

From (5.01.12) and (5.01.16) we verify that 



(5.01.19) 



y 
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satisfies (5.01.14) with y(xi) = &. In particular 

(5.01.20) h(x) = Ai(z) (' ?i(s)ds 

satisfies 

(5.01.21) L n+ i[h] = 1, h(xi) = 0. 

With h(x) defined by (5.01.20) or (5.01.21), we can obtain Petersson's 
form for the remainder 

(5.01.22) R(x) = /(*) - <f>(z) = f(x) - ZifoA<(aO, 

i.e., the error made in representing /(#) by $(#). The function R (x) 
vanishes at the n + 1 points # t . For any x f ^ x<, we can choose C so that 

/(a?') - *(*') - C%(*') = 0. 



It is clear that h(x') ^ 0, since otherwise h(x}, which vanishes at every 
Xi, would have at least n + 2 zeros, whence L n +i[h] would vanish at 
least once, contrary to (5.01.21). When C is chosen so that 



f(x) - *(x) - Ch(x) 
vanishes at x', that function has n -f 2 zeros, whence 

- *(*) - Ch(x)] = L n+l (f(x)] - C 



vanishes at least once. Hence for some on (a,6), C = L n +i[/()]. 
Hence 

R(x') = 



Hence if we drop the prime, we have 

(5.01.23) /(*) - ZyA'W -f L n+l [f($]h(x), 

where is some point on the interval, and h(x) satisfies (5.01.20) and 
(5.01.21). 
Since certainly f(x) satisfies 

(5.01.24) L n+1 [y] = L w+1 [/], 
we can apply (5.01.19) and assert that 

(5.01.25) 



This can also be written 
(5.01.26) /(*) = *() -h 
because of (5.01.17). 
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Since / * = / * / , the remainder R can be written 

Jxi Ja Ja 

(5.01.27) R(x) = * 0(*,s)L n+ i[/(s)]rfs - A,-(o;) " l\(s)L n+1 (f(s)]ds 



after applying (5.01.16) to the first integral. The formula remains 
equally valid, however, when the end point b of the interval of interpola- 
tion replaces the end point a in the limits of integration. When this 
replacement is made and the two equivalent expressions for R(x) are 
combined, one obtains the more symmetric form, 



(5 01 28) 

2K(x ) s) = g(x,s) sgn (x - s) - 2A,-(a;)r<() sgn (x* - s), 



where sgn u is the signum function whose value is -f 1 when the argument 
is positive and 1 when the argument is negative. Although this func- 
tion is discontinuous where the argument vanishes; nevertheless the 
kernel K(x,s) remains continuous, since g(x,s) vanishes at s = x, and 
r( s ) = ff( x i, s ) vanishes at s = a*. 

It is possible to generalize this development to cases where certain con- 
ditions yi = $(xi) are replaced by conditions of the form/^a:,-) $ M (Xi), 
which require the equality of derivatives of / and the approximating 
function 4>, rather than equality of their functional values. As one 
special case, consider the requirements that at some point a 

/<">() = * ( ">(a), a = 0, 1, ... ,n. 

This gives the Taylor expansion when the functions are polynomials. 
Let the functions ^l/ v x be chosen so that 



- 0, W) - #() = = flr"() = 0, W( 
In (5.01.13) replace a by a, the fa by the equivalent set fa, and let 



Hence f(x) can be represented in the form 

/(*) = Mo(x) 4- Wi(x) + + MM + F g(x,s)L n+1 (f(s)]ds ; 

J<* 

provided the /?'s are properly selected. On setting x = a, we find 

/So -/(a). 
Next, apply the operator L\ and again set x a to obtain 



Proceeding thus we finally arrive at Petersson's generalized Taylor's 
expansion 
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(5.01.29) 
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g(x,s)L n+1 (f(s)]ds. 



5.1. Polynomial Interpolation. Consider now the case of polynomial 
interpolation. Equation (5.0.1) becomes 

(5.1.1) P(x) = CD -f" CiX + c*x* + + CnX n . 

The determinant A is the Vandermonde determinant, 

(5.1.2) A - ~- v - = n (*> - *>)> 



1 Xn X n 

which vanishes if and only if any two of the xt coincide. Equation (5.0.8) 
takes the form 



(5.1.3) 



P(x) m -A- 1 



. . . x n Q 2/0 



X n X 



X 



I x x z . . . x n 



which coincides with (5.1.1) if we expand along the last row, but has the 
form 



(5.1.4) 



P(x) = 



when we expand along the last column. The L are themselves poly- 
nomials with coefficients which depend only upon the Xj. These poly- 
nomials are 

(5.1.5) Lt(x) 



They can be obtained by direct expansion of the determinant, or we can 
verify that they satisfy the necessary conditions if we note that 

(5.1.6) Li(xj) = &$ 

with 8ij the Kronecker 8. From this it follows that with L t (x) denned by 
(5.1.5) and P(x) by (5.1.4) we have Pfa) = y t . 
We can write Li(x) in another form if we define 



(5.1.7) 
for then 

(5.1.8) 
(5.1.9) 



u)(x) = U(x 



w'fe) = n 

vy 
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Hence 

(5.1.10) l*(x) m <a(x)/[(x ~ *>'(*.-)], 

and therefore 



(5.1.11) P(x) m <t(x) yj[(x - xdu'M], 

or 

(5.1.12) P(x) m u(x) 

The form (5.1.4) with any of the equivalent representations of the 
Li(x) is the Lagrange interpolation formula. 

If f(x) is itself a polynomial of degree not greater than n, then f(x) 
and P(x) are identical. Hence the L,-(g) satisfy the n + 1 identities 



(5.1.13) x* * 2xjL,-(z) (j = 0, . . . , n). . 

This is the identity (5.0.12) for the case <, = x j . 

The explicit forms for the Lt(x) become rather complicated when 
derivatives of P and / are to be equated at some of the points Xj, but in 
any particular case they can be formed from the determinantal expression. 
However, an important special case arises when both conditions P = f 
and P' = f are to be imposed at every rc t -. For this case we have Her- 
mite's interpolation formula for the polynomial H(x) of degree 2n + 1 
satisfying 

(5.1.14) H(xd = f(xi) t H'(x<) = f'(xt) (t - 0, 1, . . . , n). 
We can surely express this in the form 

(5.1.15) H(x) m ZjfcMs) 4- 



with suitable polynomials hi(x) and Hi(x), each of degree 2n -f 1 or less. 
Instead of writing down the appropriate determinant (5.0.13) and expand- 
ing, it is easier to proceed indirectly. If in (5.1.15) we set x = xj and 
apply (5.1.14), it follows that 

/(*,) = 2/fe)fcfo) + Sffe)#<fe). 

This relation must hold whatever may be the values off(xi) and offfa). 
Thus, we may have /(a?,) = 1 while /(#,) = for every i 7* j, and while 
f'fa) = for all i. This implies that hj(xj) = 1. On the other hand, 
if, for some particular k 7* j, f(x k ) = 1 while /(#) = for t 7* k, and 
f'(xi) = for all i, then we find that hk(xj) = f or k 7* j. Setting some 
1 while all other ffa) and all f(xt) = shows that every 
- 0. 
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This argument, with an analogous one applied to 
(5.1.16) H'(a:) m 2y^() + Vff$(*), 

shows that the polynomials Tn(x) and H t (x) must satisfy 

i \ kfe) = fy, #<fe) = 0, 



We ask now whether with appropriately chosen linear polynomials 
Vi(x) and Wi(x) the polynomials hi and /f^ may have the form 



(5.1.18) h t (x) - t^aOLJfr), 

These are, indeed, of the necessary degree 2n + 1. From (5.1.6) all 
conditions are satisfied for hifa) and Hi(xj) with y 5^ i. Before examin- 
ing the case forj = t, note that, since by (5.1.10) 



therefore 

'(*) = '(^)[L<(a;) + (a? - as^JLJC*)], 
"(*) = w'(^)[2L5(a:) + (x - x^L'^x)}, 

Hence "(*<) = 2w'(x-)L / .(a;i) or 2LJ(a;<) = u"(xi)/a>'(xi). Now 



by (5.1.6). Hence, by (5.1.17) ,-(a? f -) = 1. Also, by differentiating 
(5.1.18) 



so that, at a:, = tf t (xi) + u"(xi)/w'(xi), and since y t - is linear, 
(5.1.19) <(*) = I - (x - Xi)o"(xi)/w f (xi). 

Next = Hi(xf) Wi(Xi)L%(xi) so that Wifa) = 0. Also 



so that 1 = tOj(:c{). Hence 

(5.1.20) Wi(x) = a; - a*. 
Hermite's formula is therefore 

(5.1.21) H(x) = 2[y^(a?) + yJ 



with i>i and tUi defined by (5.1.19) and (5.1.20). 

Identities analogous to (5.1.13) hold for the hi and Hi. In particular 
for H(x) = 1 we have 

(5.1.22) 1 m 2v<(x)L\(x). 
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5.11. The Remainder Term. While the polynomial P(x) is determined 
to be equal to f(x) at each of n + 1 distinct values of x, in general P 7* f 
at all other values. Some expressions for the remainder 

% R(x) = /(*) - *() 

were derived in 5.01. A simpler derivation of one of these for the case 
of polynomial interpolation will be given now. Practically this provides 
only an upper bound for the error in terms of an upper bound for the 
derivative of order n + 1 of f(x) on an interval which contains all the a;. 
Even though we cannot, or do not wish to, evaluate/ or any of its deriva- 
tives exactly, we may be able to set limits to the possible values of any 
of these derivatives, and in this case the error estimates will be helpful. 
Let Xn+i be the point at which the error is to be evaluated. Define 
the function g(x) by 



Then 



1 /y* /y.2 / 


c^ 1 /(*o) 


1 r 1 1 rr 2 

J. u/n+l ^n-fl 
I X X Z . . . i 


C! /(z+i) 
P n+1 /(^) 



= g(xi) = 



s g(x). 



= 0. 



By Rolle's theorem, therefore, g'(x) must vanish at least once in each 
of the n + 1 intervals between consecutive values of the #. If we let x} 
designate the points at which g' vanishes, then again by Rolle's theorem 
g"(x) must vanish at least once in each interval between consecutive 
values x* t . Continuing thus, we conclude finally that g (n + l) (x) vanishes 
at least once at a point which lies somewhere on the interval between 
the greatest and the least of the a?,-. Hence for this we have 



= 0. 



1 





y.n+1 

(n*"H- 1) ! 



This is exact, though we know nothing about except the fact that it lies 
somewhere on the interval named. If we expand along the last row, we 
get 

- / (n4 - 1} U) - (n+l)I 

1 Xn+l X%+\ 1 Xn+l . f(Xn+l) 

and if we solve this equation for f(x n +i) and drop the subscript, we get 
(5.11.1) f(x) - P(x) 
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where is the polynomial defined in (5.1.7). Hence the second term on 
the right represents the amount by which P(x) deviates from f(x). This 
corresponds to (5.01.23). 

Now although we would not know the value of , which is in any case a 
function of x, nevertheless we may know an upper bound for/ (n+1) on the 
entire interval. Let us call this M n+ i. Then for any x on the interval 
containing all the z, "we have 



(5.11.2) \R(x)\ < M n+ Mx)\/(n + 1)!. 



The right member of this inequality vanishes for x = a?,-, as it should. 
Between successive x i9 |co(a?) | rises to a relative maximum. With uniform 
spacing of the a?, the maxima are highest near the ends of the interval, for 
if XQ < Xi < - < x n , then at one end the factor \x x n \ is large, and 
at the other \x XQ\ is large. Hence in this case the approximation is 
best for values of x near the middle of the range. For values of x out- 
side the range, the inequality (5.11.2) is still valid, provided we under- 
stand M n +i to represent a bound for/ (n ' fl) in the entire interval including 
also x. But outside the range, u(x) itself becomes increasingly large, and 
this accounts for the high uncertainty of extrapolation. 

5.12. Chebyshev Polynomials and Optimum-interval Interpolation. In 
the inequality (5.11.2) the factor M n +i depends upon the particular func- 
tion/^) but not upon the distribution within the interval of interpolation 
of the values a?< which determine the P(x). On the other hand, u(x) 
depends only upon the distribution of the x { and not at all upon the 
function. Ordinarily, in short calculations one has available a set of 
tabulated values of f(x) and must accept them as they are given. But 
when a table is being prepared, the location of the rc t can be chosen at 
will, and some choices might be better than others. 

All that we know about the variation with x of the error in the interpo- 
lation is contained in the polynomial (#). The bounds of the error are 
least exact at those points x where \u(x)\ is greatest. Hence it is natural 
to prefer a selection of points a?, on the interval which reduces as far as 
possible the greatest maximum of |co(a;)|. It is plausible to suppose that, 
since the relative maxima ordinarily vary in height from one to the next, 
a choice of the #, that reduces the highest maximum will probably raise 
some of the others. Hence we might anticipate that the minimal maxi- 
mum will be had, if at all, only in a case where all maxima are equal. 
And a succession of equal maxima suggests a trigonometric sine or 
cosine. 

Introduce a change of scale and origin so that the interval over which 
the interpolations are to be made is the interval from 1 to +1. By a 
well-known trigonometric identity, cos nO is expressible as a polynomial 
in cos 6 of degree n. This is trivial for n and n = 1, while for n = 2, 
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cos 20 2 cos 2 6 1. Suppose it verified that 

(5.12.1) cos rB = P r (cos 0) 

is a polynomial of degree r for r = 0, 1, . . . , n. Then 

cos (n + 1)6 + cos (n 1)6 = 2 cos nd cos 6 
by a formula from elementary trigonometry, whence 
(5.li2) P n +i = 2P n P 1 - F n _! 

is a polynomial of degree n -f 1 in cos 6. Hence if 

x = cos 6, 6 = cos" 1 #, < < TT, 



then P(#) is a polynomial in # of degree n, and since P = 1, PI = re, it 
follows from (5.12.2) that the coefficient of x n in P n is 2 n ~ 1 . Hence each 
polynomial 

m _ -i 

i o A, 



10 

. n = 2 1 -* cos (n cos- 1 *) (w > 1), 

has leading coefficient 1, and since all its zeros lie on the interval, it is a 
possible (#). We can prove that no polynomial R n (x) exists with degree 
n and leading coefficient 1 whose maximum absolute values are all 
numerically less than those of T n (x). 

Since T n (x) = 2 1-n cos n6, therefore T n (x) has the relative maxima and 
minima of 2 l ~ n for 



(j = 0, 1, . . . , n), 
and hence for 

X'j = COS (JTT/ri). 

Now if R n is of degree n with leading coefficient 1, and has no maximum 
or minimum numerically greater than those of T n , then 

T n (x' ) - R n (x' Q ) > 0, Tn(xQ - R n (x\) < 0, . . . , 

since the maximum of T n at x' Q cannot be less than the value of R n at a/ , 
the minimum of T n at x{ cannot exceed the value of R n at x' lt . . . . 
Hence the polynomial T n R n must vanish at least n times on the 
interval. But T n R n is of degree only n 1, and therefore 

J. n *tn = ". 

The application of this theorem is that, if one chooses the n -f 1 values 
Xi to be the zeros of the polynomial T n +i, making this the w of (5.11.2), 
then the greatest possible error of interpolation anywhere on the interval 
is 2r n M n +i/(n + 1) ! for any function whatever whose derivative of 
order n + 1 does not exceed M n+ i on the interval. Any other choice of 
fundamental points would replace the factor 2~ rt by a larger one. 
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The polynomials defined by (5.12.3) are known as the Chebyshev 
polynomials. Their zeros are easily found, for they vanish when 

n6 = (2t + l)T/2 (t = 0, 1, . . . , n - 1), 
and hence when 

(5.12.4) Xi = cos l(2i + l)r/(2w)]. 

In these formulas it is understood that the range of interpolation has been 
transformed to the interval from 1 to +1. 

Function tables may contain hundreds or thousands of entries, and 
for any particular interpolation one would expect to use only a few con- 
secutive ones. When the table is to be printed in a book, ordinarily the 
abscissas x f are uniformly spaced. When tabular entries are required 
for automatic computation, it is important to reduce to a minimum the 
number of entries to be recorded. The use of the Chebyshev points 
Xi may then be appropriate. 

Suppose that we are willing to use an interpolation polynomial of 
degree n at most and that an error of magnitude e can be tolerated. The 
entire range of the variable is to be broken up into subintervals within 
each of which at most n + 1 Chebyshev points a?,- are to be selected at 
which to evaluate the entries /(#) for the tabulation. The entries /(#,) 
on one of these subintervals will be used to determine the interpolation 
polynomial for that interval. We would like to use as few of these sub- 
intervals as possible, and hence we would like to make each subinterval 
as long as it can be made without allowing the interpolation error to 
exceed , or the degree of the polynomial to exceed n. In some circum- 
stances an optimal solution is possible. 

We consider here the problem of making a particular interval as 
long as possible. When the end points a and 6 are known, then the 
transformation 



9 ^ x = Kb - a)u + (6 + 

Z ' D; u = (2x - b - a)/(b - a), 

transforms the interval (a, 6) in the variable x to the interval ( 1, 1) 
in the variable u. If 





where w, is given by substituting Xi in (5.12.5), then 



as is verified directly. 
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In view of (5.11.1), the condition that the error shall nowhere exceed 
is that 

(*)(*) | < 



Suppose \f (n+1) (x)\ is monotonically decreasing. Then this inequality is 
surely satisfied if 

!/<*> (a) \M < (n+ l)Ie, 

v 

where M represents the maximum of |co(z)| on the interval (a, 6). If 
N is the maximum of \<t>(u)\ on the interval ( 1, 1), this is equivalent to 



(6 - a) n + l N\f< n+ (a)\ < (n 

Hence for a fixed a the longest admissible interval b a would be that 
for which the equality holds: 



(6 - d) n+1 = (n 



i 



If the Chebyshev points are used, </>(w) = T n +\(u), then N = 2~ n , and 
therefore 

(5.12.6) b - a + 4[(n + l)!2- 1 |/<+(a)|- 1 J 1/ <" + . 

When |/ (n+1) (#)l is monotonically increasing, a and b can be interchanged. 

If |/ (n+1) (#)l remains monotonic over the entire range of the tabulation, 
the range can be divided into optimal intervals by starting at one end 
and working toward the other; if it has a single maximum, one can start 
with this and work toward the two ends. Other cases will require 
special treatment. 

5.13. Aitken's Method of Interpolation. We turn now to computational 
procedures. Calculation of the Z/,-(z) directly from (5.1.5) or (5.1.10) 
involves considerable labor if the degree is higher than one or two. 
Tables of the L are available for equally spaced #,-. If these are not at 
hand, or if the #, are not equally spaced, then Aitken's method of compu- 
tation is almost ideally simple. 

We first obtain a generalization of the formula (5.1.3). Let Pi stand 
for P(xi) ; let Py stand for the linear interpolation polynomial determined 
by fa, yi) and fa, yj)', let P^* stand for the quadratic interpolation 
polynomial determined by (#,-, y,-) ( x i> 2//)> an d (xk, yk) ; . . . ; and let 
Poi . . . n stand for P itself. As for P t , we can regard it as the interpolation 
polynomial of zero degree determined by (#, 

Note that 



and that in general for any P/ . . . permuting the subscripts leaves the 
polynomial unchanged. Note also that 

(5.13.1) Ptf.. 
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The generalization of (5.1.3) is that for any m, if M stands for the set of 
subscripts m + 1, m -f 2, . . . , n, then 



(5.13.2) 



1 XQ . . . X% 


POM 


1/y /vf7l 
^m *' m 

I/P spin 
I/ . . . vl/ 


PmM 

P 



= 0. 



If in place of the P,M we were to write P,, this equation would define the 
interpolation polynomial of degree m determined by (XQ, yo), . . . , 



In proof we observe first that the polynomial P defined by (5.13.2) 
is of degree n at most, since in the expansion of the determinant x m will 
multiply each of the interpolation polynomials P,M, which are of degree 
n m, and there are no terms of higher degree. Next we observe that, 
if in the determinant we set x = ar t - for i < m, then P must take the 
value assumed by P tAf , and this by (5.13.1) is y it This is true because 
all other elements of the last row are then identical with corresponding 
elements of the row i + 1. Finally, if in the determinant we set x = x,- 
for j > m, then every P lW becomes equal to yj, making all elements but 
the last in the last column equal to y } - times the corresponding elements 
in the first column. Hence the determinant can vanish only if also P 
has the value yj. Hence the P defined by (5.13.2) is in fact the polynomial 
P of (5.1.3), and the theorem is proved. 

Aitken applies this principle in the following way: In application of 
(5.1.3) with n = 1, we have 



XQ 



01 



1 
1 
1 



x 



and hence 

(5.13.3) 
Likewise 

(5.13.4) 



12 



X XQ 
X Xi 

X X\ 
X #2 



Po 
Pi 

Pi 
P 2 



- o). 



/(*. 



a^i). 



From the theorem then we can say that 

(5.13.5) Poi2 - 

In like manner we can form 




(5.13.6) 



0123 



X 
X 



Pfll2 
Pl23 



- XQ). 
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In a specific calculation a? is a specific number, and the sequential com- 
putations yield the numerical values assumed by the various polynomials 
at that particular x. When different polynomials agree to a sufficient 
number of significant figures, one terminates the process. 

In applying these formulas, one can choose at will the particular poly- 
nomials PH ... to be evaluated, bearing in mind only that, when a pair 
PH . . , and Pkj . . . are used to evaluate P % kj . . . , they must agree in all 
subscripts but one, while the unlike subscripts i and k determine the 
Xi and Xk which are to appear explicitly. Aitken proposes a sequence and 
tabulation as follows: 

PO %o x 

PI POI Xi x 

Pz Po2 Pol2 #2 ~~ X 

Pa POS Pon POIM Xs x 

Pt Po4 PoU P(U24 ^01234 X\ X 

However, the best approximation can be expected art any stage when the 
abscissa x lies roughly in the middle of the interval containing the par- 
ticular fundamental abscissas being utilized. Consequently it is advan- 
tageous in using this scheme to order the abscissas so that either . . . 
< #4 < x z < XQ < x < Xi < Xz < - - - , or else the reverse order holds. 

5.14. Divided Differences. Aitken's method is disadvantageous when 
a number of interpolations must be carried out over the same range. 
An alternative to the computation of the Lagrange polynomials !/;(#) is 
the use of divided differences in the construction of Newton's interpolation 
formula. 

The polynomial 

(5.14.1) P(x) = do + (X # )0l -f- (X XQ}(X X^dz + * ' * 

+ (X - XQ)(X - Xi) ' ' (X ~ tf n - 

is of degree n and assumes the values /(#;) at Xi, provided 



(5.14.2) f(xi) = a 4- (x\ - 

-f 



and the coefficients a, can be determined recursively from these relations. 
From these relations it is apparent that for any function /(#) the poly- 
nomial POI . . . m (x) as defined in the last section is 

Poi2...mGr) = a + (x - xo)ai + + (x XQ) - - (x x m -i)a m , 

where the coefficients are the same as the first m -f 1 coefficients in P(x) . 
Finally, since a n is the coefficient of x n in (5.14.1), this must be equal 
to the coefficient c n of x n in (5.1.1). Hence a n = c n is expressible as the 
quotient of two determinants, where the denominator is the Vandermonde 
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of order n + 1, and the numerator is the same except that the elements 
x? are replaced by /(a?,-). Hence a m is expressible as the quotient of two 
similar determinants of order m + 1. Thus, for the given function / 
each coefficient a m in (5.14.1) is a function of the m + 1 variables XQ, 
Xi, . . . , x m . This function is called a divided difference of order m 
and is written 

(5.14.3) 



1 XQ ... 


/**m> 1 -fii*^\ 

#o J\ X Q) 




1 XQ ... 


XQ 


1 T 


/ym 1 f(r ^ 
^m J \^m) 




1 T 


x% 



For the particular case when / = x r f or r < m this vanishes for any set of 
fundamental points, while f or r = m 

f(XQ, Xi, . . . , X m ) = 1. 

Hence the divided difference of order m for any polynomial of degree m 
is a constant, and for any polynomial of degree less than m it vanishes. 

The notation [x 0) #1, . . . , x m ] is often found in the literature in place 
of f(xo, Xi, . . . , x m ), but this fails to place in evidence the function 
whose divided difference is being written. 

Now consider the expansions 



Poi = f(xo) + (x - x )f(xo, 

Po2 = f(Xo) + (X X )f(Xo, 

Poia = /(a?o) + (x - Xo)f(xo, Xi) + (x - XQ)(X - XI)/(XQ, 



By a formula analogous to (5.13.5), however, it is also true that 



-Poi2 = 



X 
X 



/ (x z - xi). 



By equating the coefficients of x 2 in the two expressions for POM(X), one 
obtains the identity 

= [f(x Q , Xz) - f(x Q , 



f(x 0) Xi 

Since the divided difference is symmetric in all variables, it follows also 
that 

f(Xo, Xi, X 2 ) = [f(Xi, Xz) - f(Xo, Xi)]/(Xz 
= [f(Xo, Xz) f(Xi, XZ)]/(XQ 

By (5.13.6) one finds likewise that 

f(x Q , Xi, Xz, 
and in general 
(5.14.4) /(z , xi, x Z) . . .) = [f(xo, Xz, . . .) - f(xi, 
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where the omitted variables are the same in all three places. This being 
the case, one can form divided differences of progressively higher order 
according to the scheme: 



f(xo, 



(5.14.5) 



f(Xo, Xi, 



f(xz) f(xi, x 2 , 

f(Xz, 



where each / is equal to the difference of the two on its left, divided by 
the difference of the x'a on the diagonals with it. . 

Another expression for the divided difference which can be obtained 
directly from (5.14.3) is of some theoretical interest. The coefficient of 
f(xi) on the right of this identity is the quotient of two Vandermonde 
determinants. When the common factors are canceled out, we are left 
with 



m 



(5.14.6) f(x , x l} . . . , x m ) - 



t-O 



This brings out again the fact that the divided difference of any order is 
symmetric in all the arguments # t which appear in it. 

5.141. Integral form of the remainder. Consider the function 



- /[(I - J)*o + 
for fixed rc and x\. This satisfies 



- *o)/'[(l - O^o + toil, 

= /(So), 



Hence integration of <'(0 and division by (x\ x ) gives 



(5.141.1) 
Now consider 



txjdt. 



/[(I - 

- x )f'[(l 
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Then 

- (si 



!, 0)] - (a?, - Si)-M/'[(l ~ <i)s 4- to] - /'[(I - W*o + 
But if we now apply the original result (5.141.1), we find 



= (s 2 - Zi^r'LfOo, Si) - /(a;o, a?i)], 

whereas the right member of this is by (5.14.4) equal to f(x , x\, x 2 ). 
Hence 



(5.141.2) 

and by a simple induction we have in general 

(5.141.3) 



4- (i - h)xi + * + t n x n ]dt n 
In the special case when X Q = xi, the relation (5.141.1) gives 



Again when XQ x\ = x 2 , it follows from (5.141.2) that 



and generally for m -f- 1 equal arguments 
(5.141.4) 



Hence if the function and its derivatives up to and including the mth are 
known at some point # , and the derivatives of the interpolation poly- 
nomial are required to equal those of the function, the table of divided 
differences is formed by writing f(x ) m + 1 times, /'(#o) m times, /'(z )/2 ! 
m 1 times, . . . , and constructing the rest of the table as before. 
In general, whether or not there are repeated arguments, the identity 



, x) = [/(*) - /(*o)]/(* - so) 
is valid for any x, including in the limit x = XQ, and it can be written 
(5.141.5) f(x) - /(s ) + (x - So)/(s , s). 

Again 

/(so, Si, s) - [/(s , s) - /(s , sO]/(s - Si), 
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and the identity can be written 

f(x<>, x) = f(x Q , Xi) + (x Xi)f(x , a; i, x). 
When this is substituted into (5.141.5), the result is 

% 

(5.141.6) f(x) = /(zo) + (x x Q }f(x , xi) + (x - x )(x - Xi)f(x<>, xi, x). 
In general, 



(5J41.7) /(*) = /(so) + (* 

+ (x x )(x - xi) - (x - x n -i 

(X - Xn)f(Xo, Xi, . . . , X n , X). 



If /were a polynomial of degree n or less, the last divided difference would 
vanish identically. For an arbitrary /, the last term represents the error 
made in replacing / by its interpolation polynomial of degree n as given 
by the preceding terms. On introducing (5.141.3), we can write 

/(*) - P(x) + R(x) t 
(5.141.8) R(x) = (*) f Q l fc ' ' ' / ''/ (w+1) [x + t Q (xo - x) 

n 

\dtc\ dL 



\ 



where P(x) is the interpolation polynomial of degree n, and R(x) is the 
remainder. The expression previously obtained for the remainder R(x} 
involved the indeterminate quantity known only to lie somewhere on the 
interval containing a?o, a?i, . . . , x n and x. Note that in case all the 
fundamental points #o, . . . , x n coincide the expression (5.141.7) 
becomes a Taylor expansion, and (5.141.8) is a well-known formula for 
the remainder. 

5.15. Operational Derivation of Equal-interval Formulas. Naturally 
the Lagrangean formulas for interpolation can be specialized to the case 
of equally spaced abscissas a;,-. However, a great many special forms 
exist, and most of these are readily derived directly by the use of an 
operational scheme. 

Suppose that the entire tabulation is made at points . . . z_ 2 , _i, 
XQ, Xi, x z , . . . and that 

Xi+i = Xi + h 
for every integer i. Then 

^ = XQ 4- ih 

for every t. The interval width h is assumed fixed throughout. On 
making the change of variable 

(5.15.1) ' x = XQ -f uh, u = (x x Q )/h, 
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the function /(x) becomes a function of u: 
(5.15.2) f(x 9 + vh) = g(u), 

for which the interval width is unity. 

We now define a set of operators with respect to the interval h. For 
any function f(x), the displacement operator E, the forward-difference 
operator A, the backward-difference operator V, and the central-difference 
operator 5 are defined as follows: 



- /(* + h), 
=/(* + />) -/(*), 
(5.15.3) V/(*) -/(*)-/(*-*), 

+ V2) -/(*- A/2). 



Since it is natural to require that 

E*f(x) = W(*)] = Ef(x + />) 
and in general 
(5.15.4) #/(*) = /(* + tift) 



for any integer u, we may indeed go a step further and accept (5.15.4) 
for all real u, integral or not. With this understanding we can write the 
following formal relations between pairs of operators: 

(5155^ A- 

v ' 6 = 

In principle any of the four operators can be expressed in terms of any 
other, but the expressions are not all simple. Thus A is the "negative" 
root of the quadratic 

A 2 - 6 2 A - 6 2 = 0, 
and V is the "positive" root of 

v 2 + 6 2 v - a 2 = o. 

In addition to the operators, we define also the factorials 

(5.15.6) w (r) = u(u - 1) (ti - r + 1) 

(x XQ)(X xi) (x x r -i)/h r 

and the generalized binomial coefficient 

(5.15.7) i4 ( r) = ti (r> /rl. 
Since Att(x) = 1, we find for these quantities 

(5.15.8) Aw (r) - ru (r ~ l \ 
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Now if u is a positive integer, since E = 1 + A, it follows that 
(5.15.9) E 1 + W(A + u<2)A 2 -f 



and the series terminates after u + 1 terms. If these equivalent oper- 
ators are applied' to / = /(#o), we have 

(5.15.10) f(x) - / + W(i)A/ + w ( 2)A 2 /o + W(3)A 8 / + , 

where x is given by (5.15.1). If u is not a positive integer, the series 
does not terminate. But the polynomial in x which results from replac- 
ing u by its expression in terms of x on the right of 

(5.15.11) P(x) - / + W(i) A/o + + w<)A/ 

is a polynomial of degree n in x, and for i = 0, 1, . . . , n it is true that 
Pfa) = /(#) Hence P(x) defined by (5.15.11) is the interpolation poly- 
nomial determined at the points a? , . . . , x n . This is Newton's formula 
for forward interpolation. In terms of a; it is 



(5.15.12) P(x) - 

II 

(/* ^__ /v _ i 
x 0:0; 



Note that the fundamental points which determine this polynomial are 
the points whose abscissas are X Q , Xi, x 2 , . . . , x n . 

Newton's formula for backward interpolation is obtained by expand- 
ing E u in powers of V : 



(5.15.13) #"-(! + V) = 1 -f M ( i)V + (t* + 1) (2 )V 2 -f (u + 2) (8) V 3 

+ . 
On dropping terms of degree higher than n, one has 

(5.15.14) P(x) - / + t*(i)V/ + (t* + l)( 2) V 2 /o + 

+ (t* + U - l)(n)V/ 

, a; XQ , (a? XQ) (x x-i 
= /0 + ~~ V/0 + 



s - so) ' ' ' - B 



For this the fundamental points are the points whose abscissas are #_, 
x-n+i, . . . , #_i, o;o. If these were the same as the points entering in 
(5.15.12), but with different designations, then the two polynomials 
(5.15.12) and (5.15.14) would be identical except in form. They would 
also be identical if / were itself a polynomial of degree n, whether or not 
the points were the same. In that case A n+1 / and V w+1 / would vanish, 
and both polynomials P(x) would be identical with /. 

There are many different forms in which the interpolation polynomial 
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of degree n can be written, and these can be obtained, one from the others, 
by neglecting differences of order n -f 1 and higher, and then by renaming 
the points. One simple scheme is based upon the "lozenge diagram." 
In the array 



-1 ; 



where parentheses are omitted from subscripts of u, the sum of the terms 
in the upper row is equal to the sum of the terms in the lower row: 



The identity follows directly from the relation 
Ur(&fn - A"/,) = [(t* + l) r+1 - 
which is easily verified. Consequently in the array 







(U + l),Uy 

^-u> J 



the sum of any two terms connected by a dash pointing down to the right 
is equal to the sum of the two terms connected by the dash just below. 
Now if we start with / and proceed diagonally downward, summing the 
first n + 1 terms, we obtain the right member of (5.15.11). But the sum 
of any other sequence of terms obtained by proceeding to the right and 
ending with A n /o will have, according to the theorem, identically the 
same value. Hence we obtain different expressions for the same interpo- 
lation polynomial. By ending on A n / for i ^ 0, we obtain an interpola- 
tion polynomial based upon a different set of fundamental points. 

It has been remarked already that with uniform spacing the interpola- 
tion is most accurate for points near the middle of the range. For compu- 
tational purposes it is convenient if the coefficients are small. Both 
conditions are satisfied if one designates as x the fundamental point 
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+ w 2 A 2 /_i + (u + 
+ (u + l) 4 A 4 /_ 2 + (u + 2) B A 6 /_ 2 



closest to the point x to be interpolated and if the series contains only 
terms as close as possible to the horizontal line through /o. The two 
Newton-Gauss formulas result: 

(5.15.15) P(x) - /o + 

and 

(5.15.16) P(aO = /'+ t*iAf-i + (u + l) 2 A 2 /_i + (u + l) 3 A 3 /_ 2 

+ (u + 2) 4 A 4 /_ 2 + (i* + 2) 6 A 6 /_ 3 -f - , 

the first following the lower, the second the upper broken line through /o- 
These last formulas are more neatly expressed by central differences. 
In this notation the same lozenge diagram appears as follows: 




The two formulas (5.15.15) and (5.15.16) appear in these notations: 
(5.15.17) P(s) = /o + tti/>4 + w 2 5 2 /o + (M + 1)35 3 / H + (u 

and 
(5.15.18) 



(ti 



(t* + 
(u + 2) 4 6 4 / 



These two formulas can be combined to give a single more symmetric 
formula, if we introduce the "central mean" operator /i defined by 



(5.15.19) 2 M = E* + E-*. 

If we add the two formulas and divide by 2, the result is 

(5.15.20) P(x) = [1 + wjxS + w 2 6 2 /2! + u(u z - 



2 - 2 2 )ju5 6 /5! -f 

If between two differences of odd order in the table one writes their mean, 
the coefficients for the above formula all lie on the same horizontal line. 
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A great many other special formulas can be derived, but for these 
reference must be made to the copious literature on interpolation. 

5.151. The remainder in equal-interval interpolation. All interpolation 
formulas that can be obtained from the lozenge diagram by following any 
broken-line path to a particular difference 5 n /o are identical except in the 
arrangement of the terms. Hence they all have the same remainder term. 
If n = 2p is an even number, the remainder is 



(5.151.1) R n+l = h +l u(u* - I 2 ) (w 2 - p 2 )/ (w+1) ()/(w + 1)1, 



where is a point on the smallest interval containing x, X- P , x- 

z , . . . , x p . If possible, for \u\ < 1 a formula which ends in a difference 

5 n / , 5 n /-i, or 6 n /i should be used, rather than one ending in some other 

8 n fi, though this will not be possible when, for example, the tabulation 

begins with u 0. The same expression (5.151.1) for the remainder 

holds for (5.15.20) if the series is terminated on a difference S 2p / of even 

order. 

If < u < 1, a formula (5.15.17) which ends on a difference of odd 
order is most symmetric with respect to that point. If n = 2p 1, then 
in this case 

* - I 2 ) [M 2 - (p - l)*](u - ?)/<+() 

( n+ 1)1 -- 

To obtain an upper limit for the truncation error from R n+ i directly, 
one must have an upper limit for/ (w+1) over the interval containing the x^ 
and this is not necessarily easy to obtain. However, it may be known 
that each of certain consecutive derivatives retains a fixed sign over the 
interval, and these signs may be known. In this event, Steffenson's 
"error test" can be applied. This rests upon the simple observation 
that in any series 

S = U Q -f Ui -f * + U n -h Rn+l, 

where R n +i is the error due to dropping terms beginning with w n +i, if it is 
known that R n +\ and 72 n+2 have opposite signs, then since 



it follows that 

\Rn+l\ < 

and the error is less than the first neglected term in absolute value. 

5.2. Trigonometric and Exponential Interpolation. Given any set of 
constants a, and ft, among the possible choices of basic functions 
for interpolation would be 

sin 
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or 



= exp (otiX + ft). 

The general methods described at the beginning of this chapter apply, but 
not much more can be said when the constants on and ft are completely 
arbitrary and unrelated. Most often, however, the on will be in arithmetic 
progression, in which case certain special simplifications are possible. 
Jf, possibly after changing scale, one can set 

a k = k, 

then for exponential interpolation one has 
(5.2.1) $(z) = 2y k e kx . 

However, if 

z - e x , 
then 

= 2y k z k = 



and the exponential interpolation is the same as polynomial interpolation 
with a transformation on the independent variable. 
Another exponential form is 

(5.2.2) *(*) - 70 4 



Given now the 2n -f 1 points, XQ, Xi, . . . , Xz n , let 

0,2n 

(5.2.3) Efa) = I] (exp ((x - x^/2] - exp [-(x - %)/2]), 

(5.2.4) ' l 
Then 

whence 

2n 

(5.2.5) 
satisfies 



and expands into a polynomial of the form (5.2.2). 
In the form 

(5.2.6) $(x) = 70 + Vie ix + y-ie~ ix -f : / + y n 



if 7* and 7_* are conjugate complex quantities, then (5.2.6) can be written 
in the form 
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(5.2.7) $(x) = o + i cos x + 2 cos 2x -\- + a cos nx 

-f 0i sin a? + & sin 2# -h + n sin nx. 



Analogous to the functions Ei(x) and e<(z) are the functions 



0,2n 

(5.2.8) 8i(x) - I] sin [(x - *v)/2], 



(5.2.9) 
Then 

2n 

(5.2.10) 



5.3. Bibliographic Notes. The literature on interpolation, methods 
of approximate representation, and numerical quadrature is vast, and 
only a few elementary general principles are given here. On operational 
methods see Whittaker and Robinson, Steffenson (1927), Aitken (1929), 
McClintock (1895), Herget (1948), Jordan (1947), and the encyclopedias, 
both French and German. 

The usual expression for the remainder in polynomial interpolation 
is (5.11.2); its generalization (5.01.23) is given by Petersson (1949). 
Steffenson (1927) gives (5.141.8). The form (5.01.25) for polynomials 
was given by Peano (1914; see also Mansion, 1914); Kowalewski (1932) 
gave this and (5.01.28), still for polynomials only. The more general 
form (5.01.25) for arbitrary interpolating functions was given by Sard 
(1948a) and Milne (19486). See also Curry (1951a). These will 
reappear in the following two chapters. Attention is called to subsequent 
papers by Sard and collaborators. 

The discussion of optimal-interval interpolation is based on Harrison 
(1949). 

The designation "Chebyshev polynomials" is sometimes applied to 
any set of orthogonal polynomials. The term "Chebyshev polynomial" 
is sometimes applied to that polynomial of degree n which among all 
polynomials of that degree most closely approximates the function under 
consideration, where the measure of departure is the maximum absolute 
difference between the polynomial and the function. This is not neces- 
sarily the interpolating polynomial which agrees with the function at the 
Chebyshev points (5.12.4), since in (5.11.1) it is not shown that neces- 
sarily maximizes /< n + 1 > at the same time that x maximizes co = T n . No 
simple algorithm is available in general for constructing the closest poly- 
nomial approximation to a given function. 

One might suppose that by increasing the degree of the interpolating 
polynomial one would necessarily improve the approximation to any 
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given continuous function, at least when the fundamental abscissas are 
always uniformly spaced. However, this is not so. For discussion of 
these and related problems see Bernstein (1926), Ahiezer (1948), de la 
Valise Poussin (1952), Feje'r (1934), Jackson (1930), and Feldheim (1939). 

Most treatments of interpolation discuss also the subject of "inverse 
interpolation," which is the evaluation of the independent variable at the 
point where the function takes on a prescribed value, not tabulated. An 
obvious possibility is to interchange the roles of the dependent and 
independent variables and interpolate in the ordinary manner. Other- 
wise one can equate the interpolating polynomial to the prescribed 
value and solve the resulting algebraic equation. In this connection see 
Kincaid (19486). 

Most standard texts discuss divided differences. See Aitken (1932) 
for the method called by his name and Neville (1934) for a similar method. 

Tables of differences, ordinary or divided, are useful for detecting errors 
in computed tables of functions. See Miller (1950). 



CHAPTER 6 
MORE GENERAL METHODS OF APPROXIMATION 

6. More General Methods of Approximation 

In selecting a particular linear combination 
(6.0.1) 



of functions <&(#) to perform an interpolation, the aim is naturally to 
determine a combination $ which approximates the given function / at 
whatever point or points the function / is to be evaluated. In the method 
of interpolation, so-called, certain values ?/ = f(xj) of /, and possibly 
certain values y^ = / (r) (^y), must be known already, or directly obtain- 
able. Then the coefficients 7; are determined by the requirement that 
each yi = $(#) and that each y ( p = 4> (r) (av). However, this is only one 
of many possible schemes for determining the 7,-. 

An expansion in powers of x XQ by Taylor's series.up to and including 
the term in (x xo) n differs in appearance from interpolation, but can 
be made a special case in which the conditions for determining the 
coefficients are that 



(6.0.2) *<'>(a?o) = / (r) (*o), r = 0, . . . , n. 

Petersson's generalized expansion which satisfies conditions (6.0.2) but 
replaces polynomials by a more general set of functions <&(#) was given 
in 5.01. The well-known expansions in Fourier or other orthogonal 
series, with only a finite number of terms retained, illustrate other meth- 
ods of obtaining approximate representations that are useful for some 
purposes. 

In this chapter will be treated only certain fairly immediate extensions 
of the methods of interpolation. 

6.1. Finite Linear Methods. In the expansion of a function as a 
series of orthogonal functions, the coefficients are determined by integra- 
tion. The methods to be described here require summation rather than 
integration, and hence are called finite. 

When the given quantities t/ t - are not exactly, but only approximately, 
equal to the /(#), it is not necessarily worth while to require that the $(#i) 
be exactly equal to the y^ If there are n -f- 1 quantities y t -, approximately 
equal to/ at n + 1 distinct points Xi, it may be saving in time, and quite 
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adequate, to find some linear combination (6.0.1) ofm+l<n+l 
functions < in such a way that the resulting $(#,) will be as nearly as 
possible equal to the y,-, without however expecting strict equality at 
least at all of the points. This is especially the case when the y, result 
from experimental measurements or result from computations in which 
the round-off is not negligible. But if strict equality can no longer be 
asked for, some other criterion must be found that will yield a system of 
only m + 1 equations in the m + 1 unknowns but that will in some sense 
treat all n -j- 1 of the points alike. 

Let the index r run from to m, to distinguish from , j, . . . which 
run from to n > m. We seek 

(6.1.1) *(*) = 27r*r(aO -/(a?) - R(x), 



where R does not necessarily vanish even at the #,-. The criterion will 
be given in the following form: Choose m -f 1 functions $ r (x) so that 
the matrix of the fafa) has rank m + 1, and then determine the y r so 
that the m -f 1 conditions 

(6.1.2) ^r(xi)R(xi) = 

3 

are satisfied. If the matrix of the <t> r (xi) also has rank m + 1, then 
Eqs. (6.1.2) define the 7* uniquely. Moreover, in the special case 
m = n, <t> becomes the ordinary interpolating function. 

Let d, c, y, f r , and p r , respectively, represent the vectors whose elements 
are R (xi), y r , y,-, < r fe), and ^vfe); let F and P represent the matrices 
whose columns are the vectors f r and p r , respectively. It is required to 
satisfy the equations 

(6.1.3) Fc - y + d, 

(6.1.4) P J d = 0. 

Thus y is to be resolved into two components, one in the space of F and 
the other orthogonal to the space of P. On combining these two equa- 
tions one obtains 

(6.1.5) P T Fc = Py 

If each matrix P and F has rank m + 1 < n + 1, then the matrix P*F 
is nonsingular, and there is a unique solution y of (6.1.5). Given a 
system of functions $r(x), the problem reduces to the multiplication of 
the matrices and solution of the linear system (6.1.5), and this is a 
problem already discussed at length in Chap. 2. 

However, it may happen that at the outset one does not know how 
many functions <f> r one should use to obtain an adequate representation in 
the sense that the vector d has been made sufficiently small. This vector 
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represents the amount by which the function $ fails to agree even with 
the given values y { of /. It is geometrically evident that, if the system 

(6.1.5) of order m + 1 has been solved, and then also a new system of 
order m + 2 obtained by adjoining a new function # r +i, the new residual 
vector d cannot be greater than the old one, and in general will actually 
be less. Moreover, if n + 1 functions are included, d will vanish since 
we are back to the case of interpolation. One would like, if possible, to 
start with a single function < and adjoin sequentially 0i, fa, . . . , 
stopping only after the vector d has been made sufficiently small. It is 
possible to do this without having to solve a new system each time by 
forming a pair of biorthogonal systems of vectors. 

The method can be developed by showing that unit triangular matrices 
U and V can be found such that for the two matrices 

(6.1.6) T = FV~\ S = PC/- 1 , 
the matrix product 

(6.1.7) &T = D 

is a diagonal matrix. This is a generalization of the theorem shown at 
the end of 2.201. 

The theorem is trivial when P and F have but a single column each. 
An inductive proof can be given, which also exhibits the algorithm by 
assuming relations (6.1.6) and (6.1.7) and showing that, given vectors/, 
independent of the columns of F, and p, independent of the columns of 
P, it is possible to find vectors u, v, t, s, and a scalar 5, satisfying 



(6.1.8) 




The last relation gives 5 as 

(6.1.9) s*t = d 
and requires that 

(6.1.10) s T T = 0, S^t = 0. 
The first two require that 

(6.1.11) TV + t = /, Su + s = p. 

Hence when the first of (6.1.11) is multiplied on the left by S 1 and the 
second by T 71 , s and t are eliminated, and equations in v alone and in u 
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alone result. Because of (6.1.7), and because D is diagonal and hence 
symmetric, t; and u are given by 

(6.1.12) v - D-Wf, u = D~ lf Fp, 

provided only that no diagonal element in D is zero. After v and u are 
found from (6.1.12), t and s are obtained from (6.1.11), and 8 from (6.1.9). 
This establishes the induction and provides an algorithm for finding the 
matrices S, T, C7, F, and D. 

Now observe that the columns of T are linear combinations of those of 
F, and the columns of $ are linear combinations of those of P. Equiva- 
lent to (6.1.3) and (6.1.4) are the equations 

TVc - y + d, S J d = 0. 
Instead of solving for c, one solves for 

(6.1.13) b = Vc 
by the simple relation 

(6.1.14) b = Dr- 



To return to the induction, suppose that b is given and that the effect of 
adjoining a new function </>, and hence a new vector /, is to be investi- 
gated. Vectors s and t are found by the method already described. 
The vector b is then unaffected except by the adjunction of an additional 
element 

(6.1.15) ft = S-^V 
The equation 

(6.1.16) Tb - y + d 
becomes 

(T, t) (J) - V + <*' 

where d' is the new residual, or on expanding and applying (6.1.16), 

d + tft ** d f . 

The vector d is orthogonal to all columns of T', d' is orthogonal to all 
columns of the enlarged matrix (T, t); in particular d' and ftt are orthog- 
onal. Hence 

(6.1.17) tfd = 0Wt + dd'. 

Thus the adjunction of the new function <f> reduces the squared length 
of the residual d by the amount ft*Pt. 
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Once it is decided how many functions # are to be included in $>, one 
can go to (6.1.13) and solve for c. Alternatively, one can form the func- 
tions r r (x) corresponding to the columns t r of T. Thus let r(x) represent 
the row vector whose elements are the first m -f 1 functions T Q (X), . . . , 
r m (x)j let <j>(x) represent the row vector whose elements are <j>o(x), . . . , 
<j> m (x). Then 

(6.1.18) r(x) = <j>(x)V~ l . 
Then $(x) is given in either of the two ways: 

(6.1.19) $(*) = r(x)b = 4>(x)c. 

The matrix V and the functions r(x) are entirely independent of the 
function to be approximated, and depend only upon the functions <f> r and 
^ r and upon the points #,-. Hence if the same basic functions and basic 
points are to be used in the representation of more than one function 
/(#), it is advantageous to find the r(x) and the matrices T, S, V, U once 
and for all. 

6.101. An expression for the remainder. The solution c of Eqs. (6.1.5) 
can be written 

c = 

and hence the function $ as 
(6.101.1) 



If for the elements yj of the vector y one substitutes the values <t> r (xj) 
of any <j> r , the result is 

(6.101.2) *,(*) = <t>(x)(P*F)- l PVr, 

since the quantities <j> r (xj) are the elements of the column f r of the mat- 
rix F. Thus (6.101.1) expresses $(x) as a linear combination of the values 
Us of f(x) the coefficients of which are certain functions 

(6.101.3) 

and (6.101.2) implies that 
(6.101.4) 

identically. 

The Petersson expansion for a = a can be written 

(6.101.5) 

when the special functions denoted in 5.01 as $ r are replaced by linear 
combinations of the < r . Moreover, 
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' g(x h s)L m+l [f(s)]ds. 



Now multiply this equation by */(oO, sum over j, and subtract from 
(6.101.5). Because ^of (6.101.4), the result is 



(6.101.6) R(x) = 

- 2^ 3>j(x) f*' g(x,; s)L m +i[f(s)]ds. 

If, as in 5.01, we write / = / / , we obtain on applying 

Ja Ja Jx{ 

(6.101.4) 

(6.101.7) R(x) = fy(x) fa g(x h s)L m+l (f(s)]ds, 

similar to (5.01.26). For the symmetric form given in (5.01.28), we 
can again write (6.101.6) with the limit b instead of a and obtain 

R(x) = ( b K(x, s)L m+1 (f(s)]d S) 

(6.101.8) Ja 



2K(x, s) = g(x, s) sgn (x - s) - 2$j(x)g(xj, s) sgn (KJ - s). 

6.11. Least-square Curve Fitting. The residual vector d always vanishes 
when m = n, provided at each step the columns of F, as well as those oif 
P, are kept linearly independent. Moreover, the vector c is then inde- 
pendent of the choice of the functions ^, given only the linear independ- 
ence of the sets of functional values. But for any given m < n, the 
vectors c and d and the length of d will depend upon the selection of the 
functions ^. Thus, for example, one could choose functions ^ each of 
which vanishes at all but m + 1 of the points Xj. This would give the 
interpolating function $(x) which passes through these m + 1 points 
and would take no account of the other points. 

It is natural to ask that for whatever m < n one might fix upon the 
vector c be chosen so that the vector d is as small as possible. In all cases 
the vector y is resolved into two components one of which lies in the 
space of F and d being the other component. Then the shortest length 
possible for d is the length of the perpendicular from the point y to the 
space of F. Hence the component Fc of y in the space of F is to be chosen 
as the orthogonal projection of y upon this space. This is effected by 
making ^ r = </> r , and hence by taking the matrix P to be the same as F. 
All equations in 6.1 can be specialized immediately to this case. It 
may be remarked in passing that in many cases there are good statistical 
grounds for minimizing the length (or the squared length) of d, but these 
will not be developed here. 
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Often experimental conditions may be such that certain values of 
the yj may be less reliable than others. If so, greatest "weight" should 
be given to the measurements of highest reliability. This can be effected 
by associating a diagonal matrix W of order n -f- 1, with positive non- 
null diagonal elements, in which the magnitude of each diagonal element, 
say the jth, measures the degree of reliability attached to the value of the 
measurement yj. The matrix W is then used as a metric for the space, 
and the orthogonality and lengths of the vectors are taken with reference 
to this metric. This can be achieved by setting 

P = WF 
in 6.1. 

6.111. Least-square fitting of polynomials. A special simplification is 
possible when the function 4> is to be a polynomial, and 

(6.111.1) <t> r (x) = x r . 

Let X be the diagonal matrix whose diagonal elements are the abscissas 
x 3 . Then 

(6.111.2) f r = X'/o. 

The argument used in proving Lanczos's theorem (2.22) can be applied 
here to show that each column t r +i of the matrix T is expressible as a 
linear combination of Xt r , t r , and t r -i. In fact, 

to = /o, 

(6.111.3) ti = (X - oV)*o, 

t r+ l = (X - Ctr^Jtr ~ M^Vr-l, T > 1, 

where 

(6.111.4) a r = tlWXt r . 

From U = /o one calculates o arid 5 , thence fa, a\, and 61, and so on, 
sequentially. The functions r r (x) are orthogonal on the set of points 
Xi and are given by 

T (aO = 1, 



(6.111.5) 

r r+ i(x) = (x - a 

6.12. Finite Fourier Expansions. Let the abscissas XQ, x\, . . . , x n -\ 
be uniformly spaced. After making a linear change of variable, we can 
assume that 

(6.12.1) x k = k/n, 
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and the function f(x) is to be considered only on the range < x < 1, 
or is supposed periodic of period 1. Make a further change of variable by 



(6.12.2) w = exp (2irix), 
Then 

(6.12.3) w* = exp (2*ik/ri), 
and 

(6.12.4) <4 = exp (2*ijk/n) = . 

The function f(x) becomes a function of , F(w), and the interpolating 
polynomial in co gives a representation of f(x) of the form 

f(x) = 0o + 0i exp (2irix) + 02 exp (4irix) + * ' 

+ |8n-i exp [2(n - l)irix] + ^(x). 

If f(x) is real, then either each term is real or else its* conjugate complex 
also occurs, if it is understood that 



(6.12.5) exp (Zkicix) cos (2kirx) + * sin (2kicx). 



Suppose n = pm is some integral multiple of m, and consider a repre- 
sentation 

(6.12.6) f(x) -00 + 01 exp (Mpx) + + 0,-i[2(m - l)irpaj] 

+ R(x) 
or 

(6.12.7) F() - 0o + 0io> p + + P,*-****- 1 * + P(w). 

If (6.12.3) is taken as valid for all integral values of k } positive and 
negative, which amounts to so denning the w* for k negative, or exceeding 
n 1, then (6.12.4) is valid for all integral values of j and k. Hence 

n 1 n 1 n 1 



*r = y j-> = y 

*-0 Jfc = ii-0 

Any w* satisfies w n = 1, but if k 7* 0, then u k ^ 1. But since 



it follows that for k 7* 
Hence 
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Hence the functions co p/ and ur ph are biorthogonal on the set a>* in the 
sense of 6.1. Then if the functions w p > are taken as the ry, and wr ph as 
the ffh of 6.1, the coefficients ft are given by 



(6.12.9) ft = n- 1 V u^TPi. 

If 2j 5^ m, then 



or 



ft - n- 1 Y 2/*[cos (2irpjk/ri) i sin (27rpjA;/n)], 

A 

ftn_y = n" 1 Y 2/*[cos (2irpjk/ri) -f t sin (27rp;fc/n)]. 

k 

Hence, if 



= n~ l i/fc cos (2Trpjk/ri), 



(6.12.10) * 

5 y = n~ l }y k sin (2irpjk/ri), 



then 



and 

ftco pfc + j8 TO _fcCo- p * = 2A fc cos (27rpfca;) + 2B k sin 

Hence if m is odd, the representation (6.12.6) can be written 



(6.12.11) /(a?) = Ao + 2 > A k cos (2rpAa) 

2 

J5fc sin 



with the A'B and B's given by (6.2.10). In case m is even, B m/z = 0, 
while ^4 m/ 2 = if n is even, and A m /z 1/n if n is odd. 

6.2. Chebyshev Expansions. It was shown in the last chapter that if 
T n is the Chebyshev polynomial of degree n, then T n is that polynomial 
of degree n and leading coefficient unity whose maximum departure from 
on the interval from 1 to + 1 is least. This property was used there to 
obtain a particular set of points of interpolation that was optimal in the 
sense there described. The property may, however, be utilized in a 
different way to obtain an approximate representation of f(x). If f(x) 
is expanded in a series of polynomials T n , and if the coefficients of these 
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polynomials in the expansion do not themselves increase too rapidly, 
then the fact that each polynomial is small over the entire interval 
suggests that a small number of terms in the expansion might provide 
an approximation that is uniformly good over the entire interval. This 
is in contrast to a Taylor series expansion which requires more terms the 
farther one goes from the center of the expansion. 

The analysis is simplified somewhat by introducing the functions 

V 

(6.2.1) C n (x) = 2 cos n6, x = 2 cos 6. 

Then 

C (oO = 1, Ci(x) = x, Ci(x) = x 2 - 2, . . . 

and in general C n (x) is a polynomial in x of leading coefficient 1. We 
consider the representation of the function f(x) on the range from 2 to 
+2 and seek an expansion 



(6.2.2) /(*) = ao + aiCi(x) + a 2 C z (x) + 

or equivalently 



oo 



(6.2.3) /(2 cos 0) = a + 2 a n cos w0, 

i 
with ranging from to TT. Since 

2 cos n0 cos ra0 = cos (n -f m)0 + cos (n m)6, 
it follows that 

(6.2.4) 2 (' cos n0 cos m6 d8 = * r n * m > 

Jo IT for n = m f 

and hence formally 

(6.2.5) a = ir- 1 fjf(2 cos 0) cos n6 d$ 

= (1/2*) 



Equations (6.2.4) express the orthogonality of the functions cos nO 
and cos w0, n ^ m, on the interval from to IT. Equivalently, C n (x) 
and C m (x), n 7* m, are orthogonal with respect to the weight function 
(4 x*)~M on the interval from 2 to +2. 

Any power x n of x is expressible as a linear combination of C n and 
polynomials of lower order: 



If we expand 

- /(O) + */' (0) 
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and substitute into the integrals (6.2.5), we obtain 
(6.2.6) an 



The first term alone on the right is the coefficient of x n in the Taylor 
expansion. If this coefficient and a n are comparable, the expansion in 
Chebyshev polynomials, carried to a given number of terms, may be 
expected to give a representation of the function that is uniformly good 
for all \x\ < 2; for 2 > \x\ > 1 it is better than the same number of terms 
of the Taylor series,, though the representation will be less good than the 
Taylor series for \x\ < 1. 

6.3. Bibliographic Notes. The remainder formulas given here are 
additional special cases of the general remainder formulas of Sard and 
Milne. 

There is an extensive literature on methods of curve fitting, and only 
a few titles are given in the bibliography. Deming (1938) extends the 
method of least squares to cases where the parameters occur nonlinearly 
by employing a method of successive linearization. Remainder formulas 
in such a situation are not available. A "method of moments," common 
in the literature, leads to functions ^ r which are different from < r . For 
the infinite case, reference is to be made to the literature on Fourier 
series, orthogonal series in general, and the problem of moments. 

On the representation by Chebyshev polynomials see Lanczos (1938), 
Miller (1945), and Olds (1950). Nonlinear methods include the use of 
continued fractions (and reciprocal differences, described in the books on 
interpolation). 

On methods of smoothing data subject to error see Schoenberg (1946, 
1952), papers by Sard, and Lanczos (1952). 



CHAPTER 7 
NUMERICAL INTEGRATION AND DIFFERENTIATION 

7. Numerical Integration and Differentiation 

A function $(#) obtained by any of the methods outlined in the last 
two chapters, to the extent that it represents the function f(x) it is 
intended to approximate, may replace / in any numerical operations 
required. Thus where a derivative or an integral of f(x) is required, one 
may differentiate or integrate *(#). Unfortunately for finding the 
derivative, though $(x) might agree closely with f(x) in value, even over 
the entire interval, it can nevertheless happen that the slopes might be 
radically different. With integration, however, the situation is more 
fortunate, for clearly if on the interval from a to b the deviation R(x) 
of $ from / nowhere exceeds e in absolute value, then the error in the 
integral cannot exceed e(a 6); and since ordinarily R(x) will change, 
signs at least once, one can expect the error to be much less than this. 

7.1. The Quadrature Problem in General. With the m + 1 functions 
<t> r (x), consider in addition a fixed function w(x) and the integrals 

(7.1.1) / <j> r (x)w(x)dx = Mr- 

Ja 

Suppose that values 

(7.1.2) y, = f(xd 

are known at n + 1 points x*. We seek coefficients X, independent 
of the values of y^, so that in the expression 

(7.1.3) f*f(x)w(x)dx - ^ X #* + R 

the remainder 72 vanishes whenever / is equal to any of the <f> r . Hence 

(7.1.4) ^ 



Here are m + 1 equations for determining the n -f- 1 required coeffi- 
cients X,-. If the equations are consistent, they can be satisfied whenever 
n > m, and if n > m, it is possible to impose n m further conditions on 
the X,-. If n < m, the equations can be satisfied if and only if the points x< 

226 



NUMERICAL INTEGRATION AND DIFFERENTIATION 227 

are so placed that the matrices (/*,>, &(#)) and (<f> r (xi)) have the same 
rank. 

Consider again Petersson's expansion 

f(x) ) a r <t> r (x) 4- f* g(x, s)L m +i[f(s)]ds, 
(7.1.5) 

g(x>, s)L m +i[f(8)]ds. 



Assuming the multipliers A< to have been found, multiply the first equa- 
tion here by w(x) and integrate from a to 6; the second by Xj and sum, 
subtracting the result from the integral just obtained. By (7.1.3), 
(7.1.1), and (7.1.4), this gives 

R f w(x) * g(x, s)L m +i[f(s)]ds dx - Y X, w grfe, s)L m+1 [/(s)]ds. 



A similar expansion written for x = b gives in the same way 

R "" fa W ^ L g ( X) s )M-i[/( 5 )]ds dx + \i gfa, s)L m+ i[f(s)]ds. 
Hence by addition of the two expressions 

R = f b M(s)L m+1 [/(s)]cte, 
(7.1.6) 7 & 

2M (s) / w(x)g(x, s) sgn (a; s)dx ) \<g(xi, s) sgn (a;,- s). 



The function M (s) is continuous since the discontinuity in the signum 
function occurs only where g vanishes. 

As examples, let m = n = 1, a x = 0, and b = xi = 1. Let 
w(x) = 1, #>oW = 1, and <t>i(x) = x. Then g(x, s) = x s, and 
Xo = Xi = ^. Then Af(s) = s(s l)/2. This gives the common 
trapezoidal rule with remainder: 

(7.1.7) f*f(x)dx = (i/o + yi)/2 - JJ 1 8(1 - s)f"(sW2. 



Next, let w(x) = e aa! , </>o(a;) = 1, and <^i(a;) = a;. Again g(x, s) = x s. 
However, 



X s ( e - 1 - a) /a 2 , A! ( a e - e -f- l)/a 2 . 
Then M() = [e as - (e a - l)s - l]/a 2 , and 
(7.1.8) l*j(x}e*dx = [y Q (e* -!-)+ yi ( - e + !)]/ 

4- [e - ( - l)a - l]f"(s)ds/a*. 



If ti;(a;) 1, (a;) 1, and <tn(x) = e* x , then g(x, s) (e<*-*> !)/, 
X - (e - l)- l cr l (ae a - e* + 1), and AI = (e - l)- 1 *- 1 ^ - 1 - a). 
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Then M () - (e - l)- 1 *- 1 ^ - 1) - e(l - e~-)]. Hence 

(7.1.9) f*f(x)dx - (e - l)- l cr l (y 9 (aef - e* + 1) + yi(e - 1 - a)] 

[s(e - 1) - e(l - r )][/"() - a/' (s)]ds. 



Finally, let w(x) = 1, fa(x) = e ax , and <jn(x) = ze*. Then 

X = (e - a - l)/a 2 , \! = (a - 1 + e-)/a 2 , 
and gr(a:, s) = (x - s)e a(z ->. Then 

+ [(1 - e ) s - l] e - 



Hence 

(7.1.10) 

+ [(1 - 



When these formulas are applied repeatedly to intervals from i 1 to i, 
for i = 1, 2, . . . , n, then formulas (7.1.7), (7.1.9), and (7.1.10) give, 
neglecting remainders, 

n-l 

*)<& = (y /2 + 



n-l 



n-l 



Equations (7.1.8) and (7.1.10) are identical if in the former f(x)e ax is 
replaced by f(x). 

In (6.101.3) and (6.101.4), the *<(x) are linear combinations with 
constant coefficients of the $ r (x). On multiplying (6.101.4) by w(x) and 
integrating, one finds 

Mr = 

and the integral on the right is expressible as a linear combination of the 
Hr. Thus the choice 

(7.1.11) X,- - 

satisfies (7.1.4) and determines the X< uniquely. Hence in case n m, 
to use (7.1.4) to determine the X< for the expression (7.1.3) is equivalent 
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to approximating the integral of fw by the integral of $w, where $ is the 
interpolating function; when n > m, one can always determine the X,- by 
(7.1.4), together with suitable further conditions so that the sum on the 
right of (7.1.3), ignoring R, is the approximation given by integrating 
w with a function <f> determined by one of the methods of 6.1. 

In case n < m, it may be possible to choose the points #, at which the 
evaluations /(#,) are to be made so that the matrices (< r (#,-), Mr) and 
(<t> r (xi)) have the same rank. In this event Eq. (7.1.4) is consistent and 
has solutions X^; moreover, if the rank is n -f- 1, the solutions X< are 
unique. 

Consider the polynomial case with 



(7.1.12) ^(x) = af. 

If the points x t are all distinct, then the matrix (<&(#)) has rank n + 1 
for m > n. Hence Eq. (7.1.4) will be consistent if every square sub- 
matrix of (x$, Mr) of order n + 2 is* singular. This implies that any matrix 
of the form 

'1 1 ... 1 Mr 

#0 Xi . . . X n Mr+1 



n+l r n+l r n+l .. 

A I ' "''n Mr+n+1, 

must be singular, and therefore the matrix 



rl 

X Q 


1 


... 1 

. . . x n 


Mo 


M2 


M2 . . . 
M3 . - 



Mn+1 Mn+2 Mn+3 



must have rank n + 1. In particular if m = 2n + 1, and the #, are 
taken to satisfy the equation 



1 Mo Ml 
X Ml M2 


... Mn 
. . . Mn+1 


X n+l Mn+l Mn+2 


. . . M2n+l 



then if these #, are all distinct, the system (7.1.4) is consistent, and defines 
a set of coefficients uniquely. 
With the Xi as determined by (7.1.13), let 

(7.1.14) w n (aO = IL(x - x,) = x n+l + co n>0 z n + + w n ,. 

Then w n (x) is equal to the determinant in (7.1.13) except for a constant 
factor. Hence 

(7.1.15) Mn+l+t + Wn,OMn+< + ' ' ' "f 0) n>n Mt = 0. 



230 PRINCIPLES OP NUMERICAL ANALYSIS 

But by (7.1.1), (7.1.13), and (7.1.14), this implies that 

(7.1.16) f b x^ n (x}w(x}dx = 0, t = 0, 1, . . . , n. 
J* 

Hence if p(x) is any polynomial of degree < n, 

[ p(x)w n (x)w(x)dx 0. 

JO' 

In particular if we consider the sequence of polynomials w , i, . . . , w n , 
it follows from this equation that 

(7.1.17) / w m (x)u n (x)w(x)dx = 0, n 7* m. 

Jo* 

The polynomials u m (x) are therefore said to be orthogonal on the interval 
(a, b) with respect to the weight function w(x). The derivation of 
Lanczos's theorem can be paraphrased to show that by denning formally 
w-i = 1, 



(7.1.18) o(z) - x - 

w r (x) - (x arS-^Wr-ifc) MAr-a(s) r > 1, 
where 

a r = / xo)^_ 1 (x)w(x')dx, 

(7.1.19) J 

S r / w?_! (a;) M> (#)<&. 
yo 

Hence to obtain the polynomial co n Oc), one can calculate recursively the 
r(#), r < n, instead of expanding the determinant (7.1.13). 

In this situation both the # and the X t were left completely arbitrary 
and determined so as to take account of a maximal set of the < r . Another 
possible procedure would be to restrict the z and X by a limited number 
of conditions, and then impose as many of the conditions (7.1.4) as 
possible. As an example of this, it is sometimes desirable to select the 
Xi so that the X are all equal: 

Xo = Xi = = X. 

There are then to be determined the n + 2 quantities XQ, . . . , x n and X, 
and therefore in general n + 2 conditions (7.1.4) can be imposed. For 
polynomials <f> r these have the form 



X = /ioAn + 1), 
i = (n 



(n 

Hence the sums of .the powers up to the (n + l)st are known. From 
these the elementary symmetric functions of the Xi can be formed by 
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means of Eqs. (3.02.5), and hence the equation of degree n + 1 of which 
the Xi are the roots. Unfortunately the roots a;, do not necessarily 
always fall within the range of the integration, and when they do not, the 
method is unusable. 
7.2. Numerical Differentiation. Equation (6.101.7) can be written 



(7.2.1) /(*) = >(*) -f *,(*) f* g(*i, 



3 

Hence 



/'(a?) - *'(*) -f *<(*) g(x it s)L[f(8)]ds + *i(x)g(x h x)L[f(x)]. 

i 
But 

Xj, s) = 



and 

gr(a;, a;) = 0. 
Hence 



(7.2.2) f(x) = $'(*) + 

Again 



and 

d^fe )/da; x ~ = 0, 
so that 
(7.2.3) 



y 



Similar relations hold for / (l>) , ^ < n. 

For polynomial interpolation in terms of divided differences the 
equation 

(7.2.4) f(x) = /(zo) -h (x - x Q )f(xo, xi) + (x - x )(x - Xi)f(x , xi, x z ) 

+ ' ' + (X X Q ) ' ' ' (X X n -i)f(XQ, X 1} . . . , X n ) 

+ (X iC ) * (X Xn)f(x Q , Xi, . . . , X n , X) 

is exact, the last term representing an expression for the remainder. Let 

*<r> = (X - Xo) ' ' ' (X - X r ). 

Then 

(7.2.5) f'(x) = f(x , a?i) + ^(aJo, *i, )+ 

, . . . , *n) + R', 



232 PRINCIPLES OF NUMERICAL ANALYSIS 

where 

(7.2.6) R' = xf M f(xo, xi, . . . , x n , x) -j- X( n )f(xo, xi, . . . , x n , x, x). 

When x = z, X( n ) = vO, and the second term drops out of the remainder. 
Continuing, one finds 

(7.2.7) /"(*) = 2f(x<>, xi, z 2 ) + a&f(x 9t xi, x*, x t ) + 

+ <-D/(*O, ...,*) + R", 

(7.2.8) R" = s<i)/(3o, i, . . . , x*> x) + 2x f (n} f(x Q , Xi, . . . , a;, x, x) 

4- 



All divided differences which occur in these remainders can be expressed 
in integral form as in 5.141. 

7.3. Operational Methods. For polynomial interpolation with equally 
spaced fundamental points, formulas for numerical differentiation and 
integration can be derived by operational methods. The Taylor expan- 
sion of an analytic function can be written in the form 



vh) = /(so) + 
Hence by introducing the differential operators 

(7.3.1) D = d/dx, 6 = hD 

and recalling the definitions of the displacement and difference operators,. 
we can write formally 

E* = 1 + uhD + u*h*D*/2l + - = e w ", 
and therefore set 

(7.3.2) tf=l + A-(l- V)- 1 - , 
whence 

(7.3.3) - log E = log (1 + A) = - log (1 - V). 
Hence 

(7.3.4) - A - A 2 /2 + A 8 /3 - A 4 /4 + - 

= V + V 2 /2 + V 3 /3 + V 4 /4 + . 

These expansions can be used to obtain the derivative at any point XQ in 
terms of forward or backward differences. To obtain /'(#) = f(x Q + uh), 
write 

(7.3.5) E0 = (1 + A) log (1 + A) = A + (2u - l)A 2 /2 

+ (3w 2 - 6u + 2)A 2 /6 + 

or the corresponding expression in terms of V. 
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For the derivative in terms of central differences, we have 



5 - E - E~ - e e/z - e~ e/z = 2 sinh (9/2), 

M (E + E~)/2 - (e /2 + e-'/ 2 )/2 - cosh (0/2). 

From this it follows that formally 

dB/dO = M = (1 + 3 2 /4)W 
and therefore 

(7.3.6) = / * (1 + T/4)-Wdr 

12> * 3 i 2 3 2 - a 5 



"" 3!-2 2 5!-2 4 

This formula gives the value of /'(a?o) in terms of the values of / at 
points # n /2. To obtain /'(XQ) in terms of values of / at points x n , proceed 
as follows: 

It can be verified directly that 0/n satisfies the differential equation 



(7.3.7) (1 + *V4)d(*/M)/d* + (5/4) (*//*) = 1. 

Also 6/p is an odd function of 5. Hence assume the expansion 
91 VL = 5 



substitute into the differential equation, and equate coefficients of like 
powers of 8. By this means one finds that 

(7.3.8) e = ji 

To continue to higher derivatives, we have next that 

d(6*)/d8 = 26 dd/d8 = 26/n. 
Hence 



or 

(7.3.9) 2 = 2[ia 2 - is< + 

It can now be shown inductively that 
and 



From this one can proceed sequentially to find 3 /n, 0*, 6 /M> .... 

Some estimate of the error can be had by noting the magnitude of 
the first neglected term. It is possible to obtain an exact expression for 
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the error in any particular case by a method that will be used for numer- 
ical integration, now to be described. 

Ordinarily in numerical integrations the integral is to be evaluated 
between two of the interpolation points, say a;, and Xj. At any rate it is 
no restriction to assume that the lower limit is such a point. Let F(x) 
be any function satisfying 

(7.3.10) DF(x) = /(*). 

One wishes to evaluate (E u E i )F(xo) where i is an integer. If i = 0, 
we have 



(7.3.11) f(x)dx = (E - 

+ uW/3 \+ 

= uh(l + u6/2 + u*d*/3\ -f -)f(x 9 ). 



If the powers of 6 are replaced by their expansions in terms of any of the 
difference operators up to whatever power may be desired, the result is a 
formula for numerical integration. If one so desires, he can replace all 
difference operators retained in the formula by their expressions in terms 
of the displacement operator E, so obtaining a formula directly in terms 
of the/Car,-). 

7.31. The Trapezoidal Rule. Next to a Riemann sum this is the 
simplest rule of all. It is given in (7.1.7), but will now be derived' 
operationally. In (7.3.11) set u = 1; we carry the expansion to the first 
power only of the difference operator, and to this degree of approximation 
= A. Hence (7.3.11) gives 

(7.31.1) *J(x)dx = h(\ 



To evaluate the remainder one can proceed as follows: The formulas 
(7.31.2) e e = 1 + 9 + 8* jf 1 TC< I -*>' dr 



are exact and can be verified directly by integrating by parts. If e 9 is 
replaced by E, and e 9 1 by A, one has by the second of these 

AF(* ) = h (l + W + Y^ Jf l 
and by the first therefore 

- h 
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Thus the integral operator provides the desired remainder. This can be 
transformed 



= h* J r(l - T)/"[*O + (1 - r)h]dr 
- [* l (xi-v)(v - 

Jxn 

by introducing the new variable of integration 

V XQ -j- (1 r)h = x\ rh. 
Hence we obtain finally the trapezoidal rule with remainder 

(7.31.3) [* l f(x)dx = yih(y Q + yi) - Y 2 [* (*i - 

JXQ JX* 

By the law of the mean, since 

f xi (*i -)(- x*)dv = (a?i - a;o) 8 /6 = 
yo 

we can write 

(7.31.4) f' l f(x)dx = y 2 h(y, + 2/0 - Y^T(k^ *o < S < XL 

JXt 

Upper and lower limits for the true value of the integral can be had by 
introducing minimum and maximum values of /". If /"(#) does not 
change sign on the interval, the law of the mean can be invoked again to 
give 



Since the quadratic factor cannot exceed /i 2 /4, it follows that, when /" 
does not change sign, then for some positive e < 1 it is true that 

(7.31.5) [* l f(x)dx - y 2 h(y Q + y,} - K&W - yi). 

JXQ 

7.32. The Maclaurin Quadrature Formula. The identities (7.31.2) are 
special cases of the more general form 

(7.32.1) e e = 1 + u$ + - + u m 6 m /m\ 

+ (M^+^+V^O f* i m e< l -*M dr. 

On setting w = 1 and w = 1, subtracting, and applying to F(x ), one 
obtains 



(7.32.2) 



r*i 

/ /(*)cte 

J-i 
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where 



(7.32.3) R - 



n 

/ 

Jo 



(2m + 1) ! 
r 2 -tM/ (2m+1) [^o + (1 - r)/i] - / (2w+1) [a;o - (1 - r)h]]dr 

= 75Sr:riT! /* A (^-^ 2m+1 [/ (2 ' n - H1) (^o4-^)-/ 
(zra -f- i;i Jo 

This is the Maclaurin quadrature formula. For w = 0, 

(7.32.4) 



The integral on the right expresses the difference between the area under 
the curve and that under the tangent to the curve at the point (XQ, yo). 
7.33. Simpson's Rule. When m = 1, the formulas (7.32.2) and 
(7.32.3) in operational form are 

E - E~ l = 26 + y*& + (0 4 /6) 



with operators applied to F(XQ). One factor on the right applied to 
F(XQ) replaces it by hf(xo). Consider the term 2 . Since 

2 + 5 2 = 2 cosh 6 = e e + tr 9 
by (7.3.5), we can apply (7.32.1) to obtain 

52 = 02 + 1^03 Jf J T *(J01-r _ JJ-d- 

Hence 

(7.33.1) E - E- 1 = 26-}- ^06 2 - >^0 4 T r 2 (l - 

- M(^ + 4 + tf- 1 ) 

- ^0 4 JT 1 r 2 (l - 

This gives Simpson's rule with remainder when applied in the usual way 
to F(XQ). After changing the variable of integration on the right, one has 



(7.33.2) f' l f(x)dx = 

JX-i 

- v)*vlf'"(xa 



It is customary to point out that, whereas Simpson's rule utilizes only 
second differences, 5 2 , and should therefore have a vanishing remainder 
when / is a quadratic polynomial, the remainder vanishes in fact even 
when / is a cubic polynomial. This follows from the fact that for a 
cubic polynomial / the third derivative is constant, and the integrand 
therefore vanishes in the remainder term. 
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In case/"' is monotonic (but not necessarily constant), the integrands 
which occur on the right in (7.33.2) and in (7.32.3) f or m = 1 are both 
positive or both negative. However, one integral appears as an addi- 
tive, and one as a subtractive, term. Hence in case /'" is monotonic, of 
the two approximations h(y-i + 4y 4- yi)/3 and 2h(yo -f /i 2 yJ'/6), one 
is an overestimate and one an underestimate of the integral which appears 
on the left. 

Another "enclosure" theorem can be obtained when/'" is monotonic, 
a theorem, that is, which provides an upper and a lower bound to the true 
value of the integral. When/'" is monotonic, then the difference between 
the derivatives which appears in the integral on the right of (7.33.2) is 
bounded by and/"'(zi) /'"(a;_i). Hence for some positive e < 1, 



(h - tOM/'"(*o + v) - /'"(a* - v)]dv = 6 (yi" - j'i) f Q h (h - vYv dv 



Hence 

(7.33.3) f' l f(x)dx = 

JX-l 

The bounds are obtained by setting c equal to and 1. 

7.34. Newton's Three-eighths Rule. To obtain a formula using third 
differences requires the expansion of e e to the fourth power of 0. In 
order to gain the advantages of symmetry in the expressions, consider the 
evaluation of (E - E~^)F(x Q ) in terms of E**/(x 9 ) and 
assuming the fundamental abscissas to be XK and XH. Write 



-f 

and the corresponding expansions of e~ 39/2 with remainders R-^ and 
R- '. We find then that 



(7.34.1) e 30 ' 2 - e~ 3e/ * = E - E~* = 30(1 

The expression 1 + 30 2 /8 is required in terms of /t, M 3 , and remainders. 
For this we have 



- 6 M = 2 

and 

j0w + ^-w = 2 M = 2 + 

Hence 

(7.34.2) 1 + %0 2 = M 8 - ( + 

The required formula is obtained by combining (7.34.1) and (7.34.2). 
Write this result: 



(7.34.3) f(x)dx - %/i(y- + 3y_^ + 3y H + y H ) + R, 
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where R is obtained by combining the various remainder terms which 
appear above. Consider the terms in QR^ and QR^. When either of 
these operators is applied to F(XQ), they yield integrals with respect to r 
between the limits $nd 1 of a polynomial in r multiplied by 



By introducing a new variable of integration 

v = XQ + 3^(1 - r)/2, 



the integrals are taken from XQ to x^. After a little algebraic manipula 
tion, when the coefficient is included, they combine to give the integral 



Xt 



- v)f' v (v)dv 

A similar manipulation of OR-% and 6R-^' gives 

V)*(XQ - t 



a result that can be written down immediately from symmetry. 

There remain the terms in 0Ry, and 0R-^. For the first of these 
introduce the variable of integration 



v = x Q + h(l - r)/2, 
the integral being taken then from XQ to x#. The result is 



The same integral from a?_^ to XQ results from the term in OR-#. These 
three integrals can be rearranged to give finally 



(7.34.4) 



f * H 
J*- 



Equations (7.34.3) and (7.34.4) together give Newton's three-eighths rule 
with remainder. 

The remainder just given is identical in form with that which one 
would obtain by using (7.1.6), and the polynomials in v which multiply 
f lv (v) in the several integrands define the function M(v) which appears 
in (7.1.6). More generally, the operational method is merely a device 
that may be applied for calculating a quadrature formula and remainder 
in the special case when the intervals are equal, the base functions are 
polynomials, and w(x) = 1, whereas the method of 7.1 can be applied 
in general. 
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It may be noted that M (v) is of fixed sign throughout the interval of 
integration in (7.34.4), and hence the law of the mean can be applied as 
in the previous cases. 

7.35. Open Formulas. The trapezoidal rule, Simpson's rule, and the 
three-eighths rule are said to be of closed type, since the integration 
extends from the first to the last of the fundamental abscissas. A formula 
of open type is one which extends beyond. Such a formula is less exact 
but is often required, in particular in most methods of solving ordinary 
differential equations. In the most common methods for doing this, the 
dependent variable y is evaluated sequentially at successive points 
Xi, Xi+i, Xi+2, .... When y,- has been evaluated, then one proceeds 
to evaluate y i+ i, first approximately by an open integration of y', there- 
after by successive approximations by closed integrations, each closed 
integration employing the currently available approximation to y' i+l . 
Under quite general conditions this process converges. 

There are many possible formulas of open type, just as there are 
many of closed type. The three of the latter type which have already 
been given by no means exhaust the list. Of open formulas we shall give 
only one, which Milne uses in conjunction with Simpson's rule as the 
closed formula, for solving a differential equation. The formula in ques- 
tion gives the integral of y = f(x) from x-% to #+ 2 in terms of the three 
middle ordinates y-\, yo, and y\. Hence it is a quadratic formula. 

In (7.32.1) for m 3, set u = 2 and u 2 and subtract: 

E 2 _ E -2 = %e[$ + 2d 2 + 26* f 1 r 1 (# 1 <i-'>- E-^ 
Again in (7.32.1) for m 1, set u = 2 and u 2 and add: 



E + E~ l = 2 -f- 2 + Ytf* [ l 



The left member of this equation is 5 2 + 2; on eliminating the 2 from 
the first and applying the operator to F(x Q ) as usual, we have 



Xt 



'f(x)dx = %h(Z + 25 2 ) -f R, 

JX-l 

where 

R = * 4 { ~ T 2 [f"(*i - rh) - /'"(*_! + rh}]dr 

+ 2 f* T*[f"'(x z - 2rh) - /"'(*_ 2 + 2rh)}dr}. 
After a change of the variables of integration this can be written 
(7.35.1) R - % ( xt (x 2 - tO 



r <*o - 

Jx-i 
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This is the remainder R in the required formula 



(7.35.2) 



X-l 



7.36. The Euler-Maclaurin Summation Formula. If the polynomials 
T) satisfy 



(7.36. 

then repeated integration by parts yields the formula 

(7.36.2) e 9 - 

If e 9 is replaced by E, and the operators applied to /(#o), one has an 
"expansion" of f(xo -f h) in powers of h, in which however the coefficients 
involve derivatives evaluated at both x and XQ + h. -Thus the generali- 
zation of Taylor's expansion of Hummel and Seebeck comes from choosing 

^ n +m = r w (l r) n /(n + w)!. 

In general, it is easily verified that for any set of polynomials satisfying 
(7.36.1) if we write 



(7.36.3) *,(r) = a r>A! + a,r^/(v - 1)1 + 

then each o, is the same for any $ 9 with v > i. Let 

(7.36.4) a,- - 
Then 



This can be written symbolically in the form 
(7.36.5) *,(r) = (B + r)'/w\ 



if we understand that in the expansion B i is to be replaced by Bi. If we 
now require that 

(7.36.6) Bo - 1, (B + I)' - B", y > 1, 

then we have a recursion that defines the B it and hence the &(r), uniquely, 
and (7.36.1) takes a particularly simple form: 



n 



(7.36.7) 
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When this is applied to F(XQ), the result expresses the integral of f(x) from 
XQ to xi in terms of/ and its derivatives evaluated at #o and x\\ 



n 



(7.36.8) 

and JR can be evaluated by applying the integral operator in (7.36.7). 
If corresponding expansions are written for the integral from x\ to x z , 
. . . , x m -i to x m , and the results added, then the derivatives drop out 
everywhere except at XQ and x m : 

(7.36.9) f xm f(x)dx = h(yo/2 + Vl + - - + y m ^ + y m /2) 

JXQ 

- l) - s/r- 15 )/"! + R'. 



v-2 

This is the Euler-Maclaurin summation formula, and it is often used for 
approximating the value of a sum Syf = S/(#,) in case the function f(x) 
is readily integrated. 

The constants B v are the Bernoulli numbers, and the polynomials 
(B -f T) " the Bernoulli polynomials. It turns out that 

B Zv +i = 0, v > 0. 
For if n is allowed to approach infinity in (7.36.7), the expansion becomes 



e) - (e 9 - 
Considering 8 a real number, we can write 



i , V D fl / i 

= 



Since the right member of this identity is an even function of 6, only 
even powers can appear on the left, and this proves the assertion. 

7.4. Bibliographic Notes. The developments in Eqs. (7.2.4) and 
following and in (7.3.6) and following are based on Steffenson (1927). 
An interesting and suggestive development of quadrature formulas based 
on polynomial interpolation is to be found in Kowalewski (1932). 

The selection of points so as to minimize the number of points leads 
to the Gaussian quadrature formulas, and that for equalizing the coeffi- 
cients leads to the Chebyshev formulas. See Sard (19486, 1949a, and 
19496) for other criteria for a "good" formula and (1951) for extension 
to more than one variable. The introductory treatment here follows 
Kneschke (1949a and 19496) in the main. 



CHAPTER 8 
THE MONTE CARLO METHOD 



8. The Monte Carlo Method 

In 1.6, it was pointed out that in many, if not most, computations 
the occurrence of the maximum possible error may be extremely improb- 
able and that it may be sufficient for practical purposes to be able to say 
that the probability is p that the error in the result will exceed some 
quantity 3, where p is small, perhaps one-tenth or one-hundredth of 
1 per cent, and 5 is within the limits of tolerance. It may well happen 
that computational labor of astronomical proportions would be required 
for a result which is certain to be in error less than 6, if indeed such a 
result is attainable at all, whereas with only moderate labor the probabil- 
ity of an error greater than 6 can be made extremely small. Thus 
though the computation is strictly deterministic, it may be both possible 
and advantageous to employ nondeterministic, i.e., statistical, methods 
to appraise the result. 

The result of the computation is therefore treated as an estimate 
rather than a true approximation. Actually physical measurements are 
often of this sort. A physical measurement may be the average of many 
measurements, all differing from one another though taken under condi- 
tions as nearly identical as it is possible to make them. Along with the 
mean, the experimenter will then compute the probable error, which is 
not the maximum error that could have been made, since that is poorly 
defined or entirely undefined. Instead, the probable error is the error 
which one expects will be exceeded half the time. In more precise 
terms this means that, if one follows the practice of asserting with respect 
to any given measured quantity that the mean of the measurements does 
not differ from the true value by more than the probable error, then he 
may expect to be wrong in the case of about half the quantities under 
consideration. But the point of primary interest here is the fact that, 
if the maximal error of measurement is not defined, then neither can 
the maximal propagated error be specified for any computation which 
makes use of these measurements as data. 

This being granted, it is reasonable to consider the feasibility of using 
nondeterministic methods for the computations themselves. This would 

242 
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mean obtaining an estimate of the desired quantity by means of some 
random sampling process, rather than obtaining an approximation by a 
rigorous computation. This is known as the Monte Carlo method. In a 
few situations it is the only feasible method known for "solving" the 
problem, though it is by no means a general method. A few examples 
will be given to explain and illustrate it, but any reasonably complete 
treatment would have to make rather extensive use of the theory of 
probability. 

8.1. Numerical Integration. Consider first the problem of evaluating 
a multiple integral, 

(8.1.1) $ = J>(aOeto. 

The variable x is taken to represent a vector in the space of the coordi- 
nates 1, . . . , n (where we could have n = 1, in particular) ; dv repre- 
sents an element of volume in the w-dimensional space; and the integral 
is to be taken between fixed and finite limits. The assumption of finite- 
ness for the range of integration is no restriction in itself, since this can 
always be achieved by a transformation of variables if necessary. How- 
ever, it is assumed that <f> is everywhere finite and bounded in this region. 
Hence the integral represents a hypervolume in the space of n + 1 
dimensions with coordinates 1, . . . , n , 17; therefore we can introduce 
scale factors and translations and assume that in the region 

(8.1.2) < 4>(x) < I 

and that the integration extends over the region 

(8.1.3) < fc < 1. 

It follows that, if points were drawn at random from the entire unit 
(n + l)-dimensional hypercube with uniform probability, then the 
probability is $ that any particular one of the randomly selected points 
has a coordinate 77 satisfying 

(8.1.4) r, < *(*), 

if x = (1, . . . , n ) represents the other n coordinates of this point. 
Hence if one made random drawings of a large number of points N, testing 
inequality (8.1.4) each time a drawing is made, and if the inequality is 
satisfied for N' of these points, then N'/N provides an estimate of $. 

This is the essential idea underlying the Monte Carlo method of 
numerical integration. However, in any digital computation one cannot 
draw arbitrary points from the cube, but only points whose coordinates 
have a digital representation. 

Suppose it has been determined that for representing the coordinate 
in the base P it is sufficient to use <r< places and that the computation of 
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<l>(x) will then be accurate to T places. Suppose, further, one has available 
some process for drawing digits 0,1, . . . , 1 at random with equal 
probability. One therefore makes <TI + 0-2 + + <r -f r drawings. 
The first <r\ digits in- order provide the representation of the coordinate 1; 
the next 0-2 provide the representation of the coordinate 2, . . . ', the 
last T provide the representation of the coordinate 17. With these repre- 
senta.tions one tests the inequality (8.1.4) to decide whether the selected 
point in (n -f 1) space lies inside or outside the volume. There is a 
question whether the equality sign should be allowed in (8.1.4), or only 
the strict inequality. If the equalities do not arise in sufficient numbers 
to make a significant difference, then it is immaterial whether these points 
are counted as inside or outside or are neglected altogether. If they make 
a significant contribution, the decision must be based upon a considera- 
tion of the routine for computing <f>. 

If one could really make random selections from all points (ft, . . . , 
, i?) in the unit hypercube, rather than from those points only whose 
coordinates are digital numbers, and could obtain the strict mathematical 
value of <f>(x) for any point, one would be repeating the occurrences of an 
event with two possible outcomes: "success" and "failure" with prob- 
abilities $ and 1 $. By standard statistical formulas one can deter- 
mine the probability that in N trials the number of actual successes Ni 
will differ from N$ by more than any given amount. 

But since only points x are drawn for which the & are digital, one is 
at best not estimating the volume <, but a slightly different volume $', 
and the statistical formula gives the probability of deviations from N&, 
rather than from N$. The nature of the volume $' is best illustrated 
for the case n = 2. 

Each of the ay digits of & can have any one of ft possible values. Hence 
there are 0* 1 *" possible points x. Associated with each is a computed 
<l>*(x), defined by the computational routine. This is a digital number 
with T places. For each x let if represent a quantity differing from <f>*(x) 
by not more than 0~ r /2, and whose exact value depends upon the error 
$*(*) "~ #0*0 an d upon the rule for including or excluding the equality 
in (8.1.4). Then $' is equal to 0-*i- ff * times the sum of the quantities if 
for all possible x. Hence the quantity $' being estimated is essentially 
that approximation to <f> that would be obtained by employing a Riemann 
sum of f}'i +<r * terms for the integral. 

The total error in the entire computation is therefore 

(8.1.5) $ - Ni/N - (* - *') + (*' - Ni/N). 

The second parentheses is the so-called sampling error, which is the 
deviation of the < estimate from the quantity being estimated. In the 
assertion that the probability is p that this error does not exceed 5, if $ 
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is regarded as a function of N for fixed p, then d is inversely proportional 
to the square root of N. Thus to cut 5 in one-half one must quadruple 
the size N of the sample. 

The first parentheses, $ $', represents the usual computational 
error, generated and residual. There is no initial error, since each 4*(x) 
is to be computed for exactly that x that has been drawn by the random 
process. For a given </>, the error <j> 4>* generated in computing any 
4>*(x} depends upon the values of the <r*. Increasing these, which is to 
say, employing more places in the computation, will decrease the gener- 
ated error, though at the expense of the additional labor involved in 
carrying along the extra places. But increasing the o-, will also decrease 
the residual error, which is the deviation of the Riemann sum from the 
true value $ of the integral. And the decrease in residual error comes 
about without the need for actually computing additional terms in the 
sum. Thus if the function < is quite irregular, so that many subdivisions 
of the range of integration would be required to make the residual error 
sufficiently small, the direct computation of the Riemann sum might 
require a prohibitively large number of terms, hence <f> computed for a 
prohibitively large number of x's, whereas in Monte Carlo computations 
the fineness of the subdivision has no effect whatever upon the number 
of values of x for which <j> must be computed. This may be extremely 
important when the space is of high dimensionality. Thus, if n = 6, 
and each <r, = 10, the Riemann sum contains 10* terms; if <r< 20, it 
contains 20 6 terms. 

There are known techniques for making Monte Carlo estimates of an 
element of an inverse matrix, and solutions of certain functional equa- 
tions, but it is not clear that the method is useful unless variables occur 
which are to be integrated out in the solution that is required. 

It should be mentioned in passing that (8.1.4) can be replaced by any 
relation equivalent to it. Thus if <j> is a square root, the relation if < # 2 
is equivalent and more easily examined. 

8.2. Random Sequences. To employ the Monte Carlo method one 
must be able somehow to obtain random sequences of digits, or at least 
sequences which resemble random sequences in all essential aspects. 
What constitutes an adequate "resemblance" is not altogether clear, but 
at least the digits used must be in roughly equal proportions, and no 
digit may show a marked tendency to follow any other particular digit. 
Printed tables of randomly selected decimal digits are available, and the 
Rand Corporation has prepared punched-card tables of random decimal 
digits. 

For high-speed machines neither printed tables nor punched cards 
are suitable sources. If one selects a 2v-digit number for v sufficiently 
large (say, 5 or more) squares, extracts the middle 2v digits, and repeats, 
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one is bound eventually to return again to the original number. Thus the 
process is cyclic. If the cycle is sufficiently long, then the extracted 
digits form a sequence that resembles a random sequence, and the opera- 
tion is easily programed for a computing machine. Similar processes are 
multiplying by a fixed multiplier and extracting middle digits; and 
squaring and reducing modulo some prime. 

8.3. Bibliographic Notes. The Monte Carlo method achieved its 
first popularity among the atomic-energy laboratories, following some 
successes in its use by von Neumann and Ulam. For general discus- 
sions see Metropolis and Ulam (1949) and proceedings of the Monte 
Carlo Symposium edited by Householder, Forsythe, and Germond (1951). 
More recently a series of papers and reports have come out from the 
Institute for Numerical Analysis: Kac and Donsker (1950), Kac (1951), 
Wasow (1950, 1951a, and 19516), Fortet (1952a and 19526), Curtiss 
(1952), Forsythe and Liebler (1950, 1951), and Cutkosky (1951). See 
also the proceedings of the several Endicott symposia, and the Quarterly 
Progress Reports of the National Applied Mathematics Laboratories of 
the National Bureau of Standards. 

For problems other than numerical quadrature, the method has been 
used for matrix inversion, for solving functional equations of various 
types, but more especially for problems associated with physical processes 
that are essentially stochastic in character. In this last connection see 
the proceedings of the Monte Carlo Symposium, and also Nelson (1949) 
and Kahn (1949, 1950). On integral equations, which specialize directly 
to linear algebraic systems, see Albert (1951-1952) and Nygaard (1952), 
in addition to references already mentioned. It is this author's opinion 
that the method has proved and will prove most useful for the intrinsically 
stochastic physical problems. 
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PROBLEMS 



CHAPTER 1 

1. Suppose that a, 6, c, and x are digital numbers and that \ax + 6| < |c|. Assume 
a machine forming pseudoproducts and pseudoquotients with maximum error . 
For calculating 

y = (ax + 



(a) If |a| < |c| and |6| < |c|, what routine is optimal, and what is the error? 
(6) If |a| > |c| and \x\ > |c|, what error may occur? 
2. If 

n 



y = cnx-\ a ", x ~ 



at and a: digital, describe a routine producing only digital intermediate quantities and 
obtain limits for the final error. 

3. A table of values of /(x) is to be prepared at equally spaced values of x with 
values of A/ given to facilitate interpolation. Should one give 



A/* = 
or 

A*/; = [f(xt+i) -/(%)]*? 

That is, should one give the rounded difference of the f's or the difference of the 
rounded /'s? 

4. Find the error in the evaluation of 



if the computation makes use of the routine described above for square roots. 

6. Obtain formulas for the errors Ay and relative errors AT// Ax due to errors Ax for 

(a) y = sin x, 
(6) y = tan x, 

(c) y = sec x, 

(d) y = exp (ax), 

(e) T/ = log x. 

6. Obtain a formula for the error Ax in the solution of the quadratic equation 
ax 2 + bx + c = if the coefficients may be in error by amounts Aa, A6, Ac. 

7. If / = 1 x 2 /a, the iteration 



converges to the square root of a. Devise a routine based upon this iteration, assum- 
ing a machine with the same characteristics as described in 1.5, and analyze the 
errors, 
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CHAPTER 2 

1. Solve in four distinct ways, using at least one direct and at least one iterative 
method: 

3.2* - 2.0t/ + 3.9* - 13.0, 
2.1x + 5.1y - 2.9* - 8,6, 
5.9s + 3.0t/ + 2.2* 6.9. 

2. If a, b, c, and d are arbitrary vectors in a plane, show that 

[b, c][a, d] + [c, a][b, d] + [a, b][c, d] - 0, 

and write the corresponding determinantal identity. 

3. If a, . . . , e are vectors in 3 space, show that 

[b, c, e][a, d, e] + [c, a, e][b, d, e] + [a, b, e][c, d, e] - 0. 

4. If ai, . . . , a 4 are linearly independent, then the vectors a* satisfying 

a* a/ = </ 
are said to form a set reciprocal to the initial one. Show that 

[ai, a 2 , a 3 , a 4 ]:[ei, e 2 , e 3 , e 4 ] - [e t , e 2 , e 3 , e 4 ]:[a l , a 2 , a 8 , a 4 ] 

= [ai, a*, a, a 4 ]:[ei, e s , as, a 4 ] 

and write the corresponding determinantal identities. 

5. If the e form an arbitrary set of (linearly independent) reference vectors, let 
a' represent the reciprocal set. Show that, if G is the matrix of e t - e/, -then G~ l is the 
matrix of e* e>. Also if a' is the covariant representation of a in the system e<, then 
it is the contra variant representation in the system e'', and conversely. 

6. With the system of equations Ax = y of Prob. 1, form the equivalent system 
A 1 Ax =5 A*y. Solve using (i) Seidel iterations, (ii) relaxations, (iii) triangular 
factorizations, and (iv) the method of Stiefel and Hestenes. 

7. Suppose it required to evaluate a T x, where a is a known vector and x satisfies 
Ax y. Show that in the factorization of the bordered matrix 



(4 o) - "*'. 



where L' is unit lower triangular and W upper triangular, a*x is the element in the 
lower right-hand corner of W. 

8. In the process of making a triangular factorization of a matrix A, certain quanti- 
ties may vanish and necessitate a reordering of rows or columns, or both, in order to 
proceed. Show that the process will go through without such rearrangements if and 
only if all the following determinants are non-null: 



CHAPTER 3 

If f(x) is of degree r < n, and 



n 
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has no repeated factors, then 



f(\ 


1 ... l 




1 ... l 

I/o ... Xft 


f(x) 

g(x) ~ 


T n """^ ^ n ""^ 

f(Xo) fM 




x; ... x; 




X "" XQ X *"" Xn 







2. From Eqs. (3.02.9) express & as a determinant in the ^ and 0y as a determinant 
in the Si. 

3. Obtain equations similar to (3.02.5) and (3.02.9) relating the Si and the s/. 
From these obtain determinantal expressions for members of each set in terms of those 
of the others. 

4. If the Xi are distinct, show that 

1 1 



1 1 



Generalize for Xi, x 2 , . . . , x n . 

5. Evaluate TT as the smallest root of or 1 sin a: =0 by applying (t) Bernoulli^ 
method and (ii) Graeffe's method, either to the equation itself or to a suitable trans- 
form. 

6. Accelerate the convergence of the Bernoulli sequence for the preceding problem 
by applying the S 2 process. 

7. Form a third-order iteration for x* + x 1 =0. 

8. Form a second-order polynomial iteration for the same equation. 
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CHAPTER 4 

1. Find the characteristic equation of the matrix 




2. Evaluate the largest proper value (s) of the above matrix by iteration. 

3. Apply Lanczos's method (4.23) to obtain the proper values and proper vectors 
of the same matrix. 

4. Diagonalize by the method of 4.115 




5. Obtain the triple-diagonal form (4.24) of this matrix. 

8. The largest proper value of a certain matrix A of order n is to be evaluated by 
iteration. Though the sequence (A*)*x will give more rapid convergence than the 
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sequence A*x, the formation of A* requires n* preliminary multiplications. Assuming 
p iterations of A would be required, for what minimal value of p would it be more 
efficient first to obtain 



CHAPTER 5 

1. If g(x) is of degree n 1 or less, show that 



2. Show that Zatf/w'fa) = 1. 

3. Show that Zrr^'^O /'(*<) = n( n + 

4. For any integer v > show that 



6. For x = cos < the functions 

U n -i(x) = 2 l " n sin n<t> esc < 

are called Chebyshev polynomials of the second kind. Show that they are poly- 
nomials, obtain a recursion, and determine their zeros. 

6. The values of /(#<) and their differences are to be tabulated, with the Xi equally 
spaced, but an erroneous value /(so) + * is entered in place of /(x ). Show the effect 
of this error on the values of the successive differences. 

7. The trigonometric functions are known exactly for certain values of the argu- 
ment: 0, 30, 45, .... Other values of the sine are to be obtained from these 
by interpolation. Use an error formula to ascertain how many figures are reliable 
if the interpolating polynomial is quadratic; if it is cubic. 

8. Using the Chebyshev points, form the cubic interpolation polynomial for interpo- 
lating values of the sine over one quadrant. 

CHAPTER 6 

1. Taking Xi = t, use the method of 6.111 to construct the polynomials r r (a;), 
r = 0, 1, . . . , 6, orthogonal on the points 3, 2, . . . , +3, with W = /. 

2. Experimentally measured values yi of f(xi) are given at points Xi = i. Values 
f(xi) of the derivative are desired. A standard method is based upon finding the 
polynomial of some degree giving the best least-squares fit and differentiating. The 
result depends upon the number of points used and upon the degree of the polynomial. 
As an example, obtain formulas for/'(0) in terms of t/_ 3 , t/_ 2 , . . . , 2/3. 

3. An approximation of the form (6.0.1) to f(x) is required giving the best least- 
squares fit to the data, subject to the restriction that the vector c of the coefficients yi 
is constrained to satisfy 

Be = z 

exactly (neglecting rounding errors). Show that, with the auxiliary vector w, c is 
determined by the system 

F^Fc + B*w = F]y, 
Be z. 

4. Obtain the expansion (6.2.2) for 

/(x) - cos (2 + x)*/&. 
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5. Find the best linear polynomial and the best quadratic polynomial, in the sense 
of least squares, for fitting the data 

a; 2 6 10 15 18 21 
y 14.0 13.0 10.7 8.0 5.0 2.9 1.0 

and find the residuals for each. 

6. If f(x) is odd with the period 2?r, find the approximating trigonometric function 
given 

X 7T/6 7T/3 7T/2 27T/3 57T/6 7T 

y 2.5 4 4.5 4 2.5 0. 

7. If xu sr log (1 + #), then e xu = 1 + x. Consider this as an equation in u for 
fixed x. Then the equation 

f(x) s (1 + x - e* u )/x = 

can be solved for u by the method of 3.2, yielding a sequence of rational fractions in 
x approximating log (1 + x). Obtain the fifth term in this sequence. For what 
values of x does the sequence converge? 

CHAPTER 7 

1. Apply the Euler-Maclaurin formula (7.36) to show that, if p is a positive integer, 

IP + 2" + - - + n*> = n*+ l /(p + 1) 

B v p(p - i) . . . ( p - y 



2. Give a direct derivation of the recursion defined by Eqs. (7.1.18) and (7.1.19). 

3. For 6 = a = 1, w(x) = 1, calculate coo, wi, o> 2 , W 3, and the X's associated with 
each. 

4. Do likewise with a = 0, 6 = <*> , and w(x) = e~~ x . 

5. If Xi = XQ + ih, i = 3, 2, . . . , +3, obtain a formula of the form 



f X 

JX- 



X *f(x)dx = Xo/o + X 2 (/_ 2 +/,) + X 3 (/- 3 + 63) + R, 



with R vanishing for polynomials up to a degree as high as possible, and find R in 
general. 

6. By the method outlined in 7.3, obtain explicit expansions in terms of central 
differences for derivatives up to the fifth of f(x) at x = XQ. 

1. Let 



f 

Ja 



R 



represent the result of applying a numerical quadrature formula based on equally 
spaced abscissas (e.g., Simpson's rule) to the evaluation of the integral of a particular 
function f(x) between fixed limits. Show that / is an even function of h, expressible 
in the form 



where 7 is the exact value of the integral. Hence derive a formula for an improved 
approximation to 7 , given I(h) and I(h/2). 
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8. Obtain / (1 + x)^dx numerically using Gauss's method with a polynomial 

of third degree, and compare the result with the true value. 

9. Use Simpson's rule with four subintervals to evaluate 

*3 

dx/log x. 

2 

CHAPTER 8 

1. Obtain a Monte Carlo estimate of the value of the integral in Prob. 8, Chap. 7. 
(Note that y > (1 + x)^ is equivalent to y 2 > 1 + z.) 
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